Closed bela127 closed 4 years ago
it was a dependecy issue its solved
Hi, do you know the solution to that issue? I'm still puzzled after reading #27
@kunwuz The large model support source code included in the patches directory of this repository is built into the TensorFlow provided by Watson Machine Learning Community Edition 1.7.0 and later. In order for the tf.config.experimental.set_lms_enabled(True)
line to work you either need to install this version of TensorFlow or apply the source code patch file and build TensorFlow from source.
The install instructions are here: https://github.com/IBM/tensorflow-large-model-support#installing-tensorflow-large-model-support
In the case of this issue the bela127 stated:
it was a dependecy issue its solved
so I suspect there was a dependency issue, conda channel, or other conda environment issue that prevented the correct level of TensorFlow to be installed from the WML CE channel.
so I suspect there was a dependency issue, conda channel, or other conda environment issue that prevented the correct level of TensorFlow to be installed from the WML CE channel.
This is exactly what happened. By having the wrong python version, and/or not specifically choosing "Watson Machine Learning Community Edition 1.7.0" A old Tensorflow version got installed.
The solution was to explicitly specify the "Watson Machine Learning Community Edition 1.7.0" before the python Version at the conda install step
Thanks guys for the reply I installed tensorflow/tensorflow-gpu 2.1.0 from the channel by the command: conda install tensorflow-gpu --channel https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ And it is installed in a newly created conda py3.7 environment. Is that correct?
I tried your command in my environment and it was going to try and install TensorFlow 1.15. By changing it to conda install tensorflow-gpu=2.1.0 --channel https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
it was going to install the right version.
You can check that you have the 2.1.0 tensorflow-gpu version installed from the channel by looking at the output of this command:
conda list | grep tensorflow
File "NGCF.py", line 500, in <module>
tf.config.experimental.set_lms_enabled(True)
AttributeError: module 'tensorflow_core._api.v2.config.experimental' has no attribute 'set_lms_enabled'
(zyjtf2) zhengyujia@omnisky:~/ngcf-wang-v2/NGCF$ conda list | grep tensorflow
tensorflow 2.1.0 gpu_py37_915.g4f6e601 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-base 2.1.0 gpu_py37h6c5654b_0
tensorflow-estimator 2.1.0 pyhd54b08b_0
tensorflow-gpu 2.1.0 915.g4f6e601 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-probability 0.9.0 pypi_0 pypi
That's pretty weird :( I'm using tf 1.x code in tf2.1, so I also tried the instruction in readme (config setting). And the feedback is:
'Experimental' object has no attribute 'lms_enabled'
Note the version of tensorflow-base: tensorflow-base 2.1.0 gpu_py37h6c5654b_0
The bulk of the TensorFlow function is delivered in the tensorflow-base
package. Your tensorflow-base was not installed from the WML CE conda channel.
I would suggest following the instructions for setup and install here, https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html, rather than using the --channel
parameter on the conda install
. You will likely have better success.
Note, you may also need to set "strict channel priority" which is also covered on that page.
Note the version of tensorflow-base:
tensorflow-base 2.1.0 gpu_py37h6c5654b_0
The bulk of the TensorFlow function is delivered in thetensorflow-base
package. Your tensorflow-base was not installed from the WML CE conda channel.I would suggest following the instructions for setup and install here, https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html, rather than using the
--channel
parameter on theconda install
. You will likely have better success.Note, you may also need to set "strict channel priority" which is also covered on that page.
It works haha, thanks so much!
The reason is that I didn't set "strict channel priority" so that tf-base from the other channel was automatically installed when installing tf-gpu from WML CE channel.
Again, appreaciate for your kindly help!
One more question: Do you have any idea how to install cudatoolkit=10.1 while keeping TFLMS works? The driver of the server doesn't support cuda 10.2, and there are always conflicts between WML CE TF-GPU 2.1, WML CE TF-base 2.1 and cudatoolkit10.1 in conda environment.
It's always good to keep the Nvidia driver up to date. Newer ones will always work with older CUDA releases.
For scenarios where that is not possible, CUDA does offer limited forward compatibility.
In WML CE, the compat libraries are in the cudatoolkit-dev
package
So conda install cudatoolkit-dev
Then you have to set your LD_LIBRARY_PATH to the compat directory so it picks up the compat libcuda instead of the one from the driver package. There is a short cut to that directory using $CONDA_PREFIX/compat
In the next version of WML CE, we're going to offer those compat libs in a separate package to make it easier.
Note that the compat libraries only work back one level, so it likely won't work if you have a driver older than the one matched with CUDA 10.1 (which was the 418 driver)
(tf2) zh@om:~/ngcf-wang-v2/NGCF$ echo $LD_LIBRARY_PATH
/home/zh/anaconda3/envs/tf2/compat
And that's my nvidia/CUDA setting
NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.2
cudatoolkit 10.2.89 680.g0f7a43a https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev 10.2.89 680.g0f7a43a https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
And I'm using a new env following the instruction (powerai)
But I still cannot make it works. I'm not sure if I understand the setting correctly or not.
2020-04-11 03:12:18.377123: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-11 03:12:18.379353: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-11 03:12:18.380527: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-11 03:12:18.380578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-11 03:12:18.383277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-04-11 03:12:18.383320: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
Traceback (most recent call last):
File "NGCF.py", line 558, in
:
The LMS can not be used from Tensorfow 2.1 from conda IBM WML CE
tf.config.experimental.set_lms_enabled(True)
produces:
module 'tensorflow_core._api.v2.config.experimental' has no attribute 'set_lms_enabled'