IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
202 stars 38 forks source link

Not integradet in tensorflow 2 from IBM WML CE #29

Closed bela127 closed 4 years ago

bela127 commented 4 years ago

The LMS can not be used from Tensorfow 2.1 from conda IBM WML CE

tf.config.experimental.set_lms_enabled(True)

produces:

module 'tensorflow_core._api.v2.config.experimental' has no attribute 'set_lms_enabled'

bela127 commented 4 years ago

it was a dependecy issue its solved

kunwuz commented 4 years ago

Hi, do you know the solution to that issue? I'm still puzzled after reading #27

smatzek commented 4 years ago

@kunwuz The large model support source code included in the patches directory of this repository is built into the TensorFlow provided by Watson Machine Learning Community Edition 1.7.0 and later. In order for the tf.config.experimental.set_lms_enabled(True) line to work you either need to install this version of TensorFlow or apply the source code patch file and build TensorFlow from source.

The install instructions are here: https://github.com/IBM/tensorflow-large-model-support#installing-tensorflow-large-model-support

In the case of this issue the bela127 stated:

it was a dependecy issue its solved

so I suspect there was a dependency issue, conda channel, or other conda environment issue that prevented the correct level of TensorFlow to be installed from the WML CE channel.

bela127 commented 4 years ago

so I suspect there was a dependency issue, conda channel, or other conda environment issue that prevented the correct level of TensorFlow to be installed from the WML CE channel.

This is exactly what happened. By having the wrong python version, and/or not specifically choosing "Watson Machine Learning Community Edition 1.7.0" A old Tensorflow version got installed.

The solution was to explicitly specify the "Watson Machine Learning Community Edition 1.7.0" before the python Version at the conda install step

kunwuz commented 4 years ago

Thanks guys for the reply I installed tensorflow/tensorflow-gpu 2.1.0 from the channel by the command: conda install tensorflow-gpu --channel https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ And it is installed in a newly created conda py3.7 environment. Is that correct?

smatzek commented 4 years ago

I tried your command in my environment and it was going to try and install TensorFlow 1.15. By changing it to conda install tensorflow-gpu=2.1.0 --channel https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ it was going to install the right version.

You can check that you have the 2.1.0 tensorflow-gpu version installed from the channel by looking at the output of this command: conda list | grep tensorflow

kunwuz commented 4 years ago
  File "NGCF.py", line 500, in <module>
    tf.config.experimental.set_lms_enabled(True)
AttributeError: module 'tensorflow_core._api.v2.config.experimental' has no attribute 'set_lms_enabled'
(zyjtf2) zhengyujia@omnisky:~/ngcf-wang-v2/NGCF$ conda list | grep tensorflow
tensorflow                2.1.0           gpu_py37_915.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-base           2.1.0           gpu_py37h6c5654b_0
tensorflow-estimator      2.1.0              pyhd54b08b_0
tensorflow-gpu            2.1.0              915.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
tensorflow-probability    0.9.0                    pypi_0    pypi

That's pretty weird :( I'm using tf 1.x code in tf2.1, so I also tried the instruction in readme (config setting). And the feedback is: 'Experimental' object has no attribute 'lms_enabled'

smatzek commented 4 years ago

Note the version of tensorflow-base: tensorflow-base 2.1.0 gpu_py37h6c5654b_0 The bulk of the TensorFlow function is delivered in the tensorflow-base package. Your tensorflow-base was not installed from the WML CE conda channel.

I would suggest following the instructions for setup and install here, https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html, rather than using the --channel parameter on the conda install. You will likely have better success.

Note, you may also need to set "strict channel priority" which is also covered on that page.

kunwuz commented 4 years ago

Note the version of tensorflow-base: tensorflow-base 2.1.0 gpu_py37h6c5654b_0 The bulk of the TensorFlow function is delivered in the tensorflow-base package. Your tensorflow-base was not installed from the WML CE conda channel.

I would suggest following the instructions for setup and install here, https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html, rather than using the --channel parameter on the conda install. You will likely have better success.

Note, you may also need to set "strict channel priority" which is also covered on that page.

It works haha, thanks so much!

The reason is that I didn't set "strict channel priority" so that tf-base from the other channel was automatically installed when installing tf-gpu from WML CE channel.

Again, appreaciate for your kindly help!

kunwuz commented 4 years ago

One more question: Do you have any idea how to install cudatoolkit=10.1 while keeping TFLMS works? The driver of the server doesn't support cuda 10.2, and there are always conflicts between WML CE TF-GPU 2.1, WML CE TF-base 2.1 and cudatoolkit10.1 in conda environment.

jayfurmanek commented 4 years ago

It's always good to keep the Nvidia driver up to date. Newer ones will always work with older CUDA releases. For scenarios where that is not possible, CUDA does offer limited forward compatibility. In WML CE, the compat libraries are in the cudatoolkit-dev package

So conda install cudatoolkit-dev

Then you have to set your LD_LIBRARY_PATH to the compat directory so it picks up the compat libcuda instead of the one from the driver package. There is a short cut to that directory using $CONDA_PREFIX/compat

In the next version of WML CE, we're going to offer those compat libs in a separate package to make it easier.

Note that the compat libraries only work back one level, so it likely won't work if you have a driver older than the one matched with CUDA 10.1 (which was the 418 driver)

kunwuz commented 4 years ago
(tf2) zh@om:~/ngcf-wang-v2/NGCF$ echo $LD_LIBRARY_PATH
/home/zh/anaconda3/envs/tf2/compat

And that's my nvidia/CUDA setting

NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.2

cudatoolkit               10.2.89            680.g0f7a43a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev           10.2.89            680.g0f7a43a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda

And I'm using a new env following the instruction (powerai)

But I still cannot make it works. I'm not sure if I understand the setting correctly or not.

2020-04-11 03:12:18.377123: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-04-11 03:12:18.379353: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-04-11 03:12:18.380527: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-04-11 03:12:18.380578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-04-11 03:12:18.383277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0 2020-04-11 03:12:18.383320: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 Traceback (most recent call last): File "NGCF.py", line 558, in sess = tf.compat.v1.Session(config=config) File "/home/zhengyujia/anaconda3/envs/zyjtf2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1587, in init super(Session, self).init(target, graph, config=config) File "/home/zhengyujia/anaconda3/envs/zyjtf2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 703, in init self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts) tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

: