IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
202 stars 38 forks source link

Build LMS without condo #53

Closed guanwg closed 3 years ago

guanwg commented 3 years ago

I'm trying to build Tensorflow large model support on a system where conda is strongly discouraged to use. I wonder if I can get rid of condo by building from source.

After patching TF 2.1.0 with tensorflow-large-model-support/patches/tensorflow_v2.1.0_large_model_support.patch, I simply do "./configure" in the directory, which doesn't work at all. Can someone help me? I can give a full list of steps I've taken and all the output message from the building process if needed. Thank you.

Weiguang Guan

jayfurmanek commented 3 years ago

can give a full list of steps I've taken and all the output message from the building process if needed.

I think that would be best.

guanwg commented 3 years ago

Here is a full list of commands: 49927 2021-04-09 11:12:53 mkdir tflms 49928 2021-04-09 11:12:56 cd tflms/ 49929 2021-04-09 11:13:16 git clone https://github.com/tensorflow/tensorflow 49932 2021-04-09 11:16:52 git clone https://github.com/IBM/tensorflow-large-model-support.git 49935 2021-04-09 11:17:19 cd tensorflow 49936 2021-04-09 11:17:39 git pull --tags 49937 2021-04-09 11:17:52 git checkout v2.1.0 49938 2021-04-09 11:19:44 git am ../tensorflow-large-model-support/patches/tensorflow_v2.1.0_large_model_support.patch 49942 2021-04-09 11:26:38 module load bazel/3.6.0 49945 2021-04-09 11:27:08 module load cuda 49946 2021-04-09 11:27:16 module avail cudnn 49947 2021-04-09 11:27:29 module load cudnn 49949 2021-04-09 11:27:49 module avail python 49950 2021-04-09 11:28:10 module load python 49964 2021-04-09 11:48:30 module load gcc/9.3.0 49965 2021-04-09 11:48:42 ./configure I need to point out that our system uses different directories (/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64, etc) than the standard directories (/lib64, etc) to store some commonly-used shared libs. We have a script "setrpath.sh" that makes sure that the binaries use the correct interpreter, and searches for the libraries that are dynamically linked to in the correct folder. I might need to use setrpath.sh to correct some binaries after patching TF 2.1.

After the command "./configure", I got [guanw@gra-login3 tensorflow]$ ./configure WARNING: Output base '/home/guanw/.cache/bazel/_bazel_guanw/f3d7f3f9c1db2581901d1d1887bd0510' is on NFS. This may lead to surprising failures and undetermined behavior. Extracting Bazel installation... JNI initialization failed: /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so: /lib64/libstdc++.so.6: version GLIBCXX_3.4.21' not found (required by /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so). Possibly your installation has been corrupted. java.lang.UnsatisfiedLinkError: /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so: /lib64/libstdc++.so.6: versionGLIBCXX_3.4.21' not found (required by /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so) at java.base/java.lang.ClassLoader$NativeLibrary.load0(Native Method) at java.base/java.lang.ClassLoader$NativeLibrary.load(ClassLoader.java:2430) at java.base/java.lang.ClassLoader$NativeLibrary.loadLibrary(ClassLoader.java:2487) at java.base/java.lang.ClassLoader.loadLibrary0(ClassLoader.java:2684) at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2649) at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:829) at java.base/java.lang.System.loadLibrary(System.java:1867) at com.google.devtools.build.lib.unix.jni.UnixJniLoader.loadJni(UnixJniLoader.java:22) at com.google.devtools.build.lib.unix.ProcessUtils.(ProcessUtils.java:27) at com.google.devtools.build.lib.util.ProcessUtils.getpid(ProcessUtils.java:51) at com.google.devtools.build.lib.runtime.BlazeRuntime.getPidUsingJNI(BlazeRuntime.java:1400) at com.google.devtools.build.lib.runtime.BlazeRuntime.maybeForceJNIByGettingPid(BlazeRuntime.java:1389) at com.google.devtools.build.lib.runtime.BlazeRuntime.maybeGetPidString(BlazeRuntime.java:1382) at com.google.devtools.build.lib.runtime.BlazeRuntime.batchMain(BlazeRuntime.java:941) at com.google.devtools.build.lib.runtime.BlazeRuntime.main(BlazeRuntime.java:771) at com.google.devtools.build.lib.bazel.Bazel.main(Bazel.java:85) ERROR: crash in async thread: java.lang.UnsatisfiedLinkError: /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so) at java.base/java.lang.ClassLoader$NativeLibrary.load0(Native Method) at java.base/java.lang.ClassLoader$NativeLibrary.load(ClassLoader.java:2430) at java.base/java.lang.ClassLoader$NativeLibrary.loadLibrary(ClassLoader.java:2487) at java.base/java.lang.ClassLoader.loadLibrary0(ClassLoader.java:2684) at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2649) at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:829) at java.base/java.lang.System.loadLibrary(System.java:1867) at com.google.devtools.build.lib.unix.jni.UnixJniLoader.loadJni(UnixJniLoader.java:22) at com.google.devtools.build.lib.unix.ProcessUtils.(ProcessUtils.java:27) at com.google.devtools.build.lib.util.ProcessUtils.getpid(ProcessUtils.java:51) at com.google.devtools.build.lib.runtime.BlazeRuntime.getPidUsingJNI(BlazeRuntime.java:1400) at com.google.devtools.build.lib.runtime.BlazeRuntime.maybeForceJNIByGettingPid(BlazeRuntime.java:1389) at com.google.devtools.build.lib.runtime.BlazeRuntime.maybeGetPidString(BlazeRuntime.java:1382) at com.google.devtools.build.lib.runtime.BlazeRuntime.batchMain(BlazeRuntime.java:941) at com.google.devtools.build.lib.runtime.BlazeRuntime.main(BlazeRuntime.java:771) at com.google.devtools.build.lib.bazel.Bazel.main(Bazel.java:85) Traceback (most recent call last): File "./configure.py", line 1549, in main() File "./configure.py", line 1364, in main current_bazel_version = check_bazel_version(_TF_MIN_BAZEL_VERSION, File "./configure.py", line 477, in check_bazel_version curr_version = run_shell( File "./configure.py", line 156, in run_shell output = subprocess.check_output(cmd) File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.2/lib/python3.8/subprocess.py", line 411, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.2/lib/python3.8/subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['bazel', '--batch', '--bazelrc=/dev/null', 'version']' returned non-zero exit status 37. [guanw@gra-login3 tensorflow]$

Please suggest me what to do. Thank you very much.

Weiguang

jayfurmanek commented 3 years ago

Wow, ok. I'm guessing the bazel you are using was built on a different system/glibc than you are running on your system.

Bazel is easy enough to build (make sure to grab the -dist package). So I'd recommend building yourself a copy that works for your system. Since this doesn't really have anything to do with LMS, I'm closing the issue though you can feel free to reopen if you get past your bazel problem.

guanwg commented 3 years ago

I have a bazel version that was build on our system. It is located at /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/bazel/3.6.0/bin/bazel, which is on PATH. But I don't know how to specify it so that the tensorflow's configure knows it.