Closed guanwg closed 3 years ago
can give a full list of steps I've taken and all the output message from the building process if needed.
I think that would be best.
Here is a full list of commands: 49927 2021-04-09 11:12:53 mkdir tflms 49928 2021-04-09 11:12:56 cd tflms/ 49929 2021-04-09 11:13:16 git clone https://github.com/tensorflow/tensorflow 49932 2021-04-09 11:16:52 git clone https://github.com/IBM/tensorflow-large-model-support.git 49935 2021-04-09 11:17:19 cd tensorflow 49936 2021-04-09 11:17:39 git pull --tags 49937 2021-04-09 11:17:52 git checkout v2.1.0 49938 2021-04-09 11:19:44 git am ../tensorflow-large-model-support/patches/tensorflow_v2.1.0_large_model_support.patch 49942 2021-04-09 11:26:38 module load bazel/3.6.0 49945 2021-04-09 11:27:08 module load cuda 49946 2021-04-09 11:27:16 module avail cudnn 49947 2021-04-09 11:27:29 module load cudnn 49949 2021-04-09 11:27:49 module avail python 49950 2021-04-09 11:28:10 module load python 49964 2021-04-09 11:48:30 module load gcc/9.3.0 49965 2021-04-09 11:48:42 ./configure I need to point out that our system uses different directories (/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64, etc) than the standard directories (/lib64, etc) to store some commonly-used shared libs. We have a script "setrpath.sh" that makes sure that the binaries use the correct interpreter, and searches for the libraries that are dynamically linked to in the correct folder. I might need to use setrpath.sh to correct some binaries after patching TF 2.1.
After the command "./configure", I got
[guanw@gra-login3 tensorflow]$ ./configure
WARNING: Output base '/home/guanw/.cache/bazel/_bazel_guanw/f3d7f3f9c1db2581901d1d1887bd0510' is on NFS. This may lead to surprising failures and undetermined behavior.
Extracting Bazel installation...
JNI initialization failed: /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so: /lib64/libstdc++.so.6: version GLIBCXX_3.4.21' not found (required by /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so). Possibly your installation has been corrupted. java.lang.UnsatisfiedLinkError: /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so: /lib64/libstdc++.so.6: version
GLIBCXX_3.4.21' not found (required by /home/guanw/.cache/bazel/_bazel_guanw/install/13141c49c5f6c9dce5711c2abd7bb258/libunix.so)
at java.base/java.lang.ClassLoader$NativeLibrary.load0(Native Method)
at java.base/java.lang.ClassLoader$NativeLibrary.load(ClassLoader.java:2430)
at java.base/java.lang.ClassLoader$NativeLibrary.loadLibrary(ClassLoader.java:2487)
at java.base/java.lang.ClassLoader.loadLibrary0(ClassLoader.java:2684)
at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2649)
at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:829)
at java.base/java.lang.System.loadLibrary(System.java:1867)
at com.google.devtools.build.lib.unix.jni.UnixJniLoader.loadJni(UnixJniLoader.java:22)
at com.google.devtools.build.lib.unix.ProcessUtils.
Please suggest me what to do. Thank you very much.
Weiguang
Wow, ok. I'm guessing the bazel you are using was built on a different system/glibc than you are running on your system.
Bazel is easy enough to build (make sure to grab the -dist
package). So I'd recommend building yourself a copy that works for your system.
Since this doesn't really have anything to do with LMS, I'm closing the issue though you can feel free to reopen if you get past your bazel problem.
I have a bazel version that was build on our system. It is located at /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/bazel/3.6.0/bin/bazel, which is on PATH. But I don't know how to specify it so that the tensorflow's configure knows it.
I'm trying to build Tensorflow large model support on a system where conda is strongly discouraged to use. I wonder if I can get rid of condo by building from source.
After patching TF 2.1.0 with tensorflow-large-model-support/patches/tensorflow_v2.1.0_large_model_support.patch, I simply do "./configure" in the directory, which doesn't work at all. Can someone help me? I can give a full list of steps I've taken and all the output message from the building process if needed. Thank you.
Weiguang Guan