IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
202 stars 38 forks source link

After installing this package model is running on CPU #14

Closed siddas27 closed 5 years ago

siddas27 commented 5 years ago

My model was running fine on GPU, but after installing this package as mentioned in the README , now my model started running on CPU only. When I checked the available device only CPU shows up. How to fix this issue?

smatzek commented 5 years ago

Thanks for filing this issue.

With pip installed TensorFlow with GPU support (tensorflow-gpu), the install of this module was switching the implementation to tensorflow which is CPU only. To fix your environment you will need to fix up the pip modules so tensorflow-gpu is there and picked up rather than the CPU only version of TensorFlow that got pulled in by the install.

I have fixed this issue by changing the prerequisites to call out tensorflow-gpu. I did not hit this issue in my testing because I was using a conda installed TensorFlow. This issue does not recreate in that environment:

$ conda install tensorflow-gpu=1.13.1
# packages installed:
$ conda list | grep tensor
tensorboard               1.13.1           py36hf484d3e_0  
tensorflow                1.13.1          gpu_py36h3991807_0  
tensorflow-base           1.13.1          gpu_py36h8d69cac_0  
tensorflow-estimator      1.13.0                     py_0  
tensorflow-gpu            1.13.1               h0d30ee6_0  

When the tensorflow-large-model support is pip installed into a conda environment like this it does not mess up the tensorflow-gpu.

Jingnan-Jia commented 4 years ago

Thank you for your answer.@smatzek I created a environment with conda and I also install tensorflow 1.15.0 with conda. However, after I run pip install ./tensorflow-large-model-support my tensorflows were replaced by new version.

$ pip ./tensorflow-large-model-support/ ERROR: unknown command "./tensorflow-large-model-support/" (py37) jjia@res-hpc-lo98:/exports/lkeb-hpc/jjia/project/e2e_new$ pip install ./tensorflow-large-model-support/ Processing ./tensorflow-large-model-support Collecting tensorflow-gpu>=1.5 Downloading https://files.pythonhosted.org/packages/a1/eb/bc0784af18f612838f90419cf4805c37c20ddb957f5ffe0c42144562dcfa/tensorflow_gpu-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl (380.8MB) |████████████████████████████████| 380.8MB 19kB/s Collecting toposort>=1.5 Using cached https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl Requirement already satisfied: grpcio>=1.8.6 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (1.16.1) Requirement already satisfied: astor>=0.6.0 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (0.8.0) Requirement already satisfied: six>=1.10.0 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (1.13.0) Requirement already satisfied: keras-applications>=1.0.8 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (1.0.8) Requirement already satisfied: numpy<2.0,>=1.16.0 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (1.17.4) Requirement already satisfied: protobuf>=3.6.1 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (3.11.2) Collecting tensorflow-estimator<2.1.0,>=2.0.0 Using cached https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl Requirement already satisfied: wrapt>=1.11.1 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (1.11.2) Requirement already satisfied: opt-einsum>=2.3.2 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (3.1.0) Requirement already satisfied: google-pasta>=0.1.6 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (0.1.8) Requirement already satisfied: absl-py>=0.7.0 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (0.8.1) Requirement already satisfied: wheel>=0.26 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (0.33.6) Requirement already satisfied: keras-preprocessing>=1.0.5 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (1.1.0) Requirement already satisfied: gast==0.2.2 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (0.2.2) Requirement already satisfied: termcolor>=1.1.0 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (1.1.0) Collecting tensorboard<2.1.0,>=2.0.0 Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75a/tensorboard-2.0.2-py3-none-any.whl (3.8MB) |████████████████████████████████| 3.8MB 29.1MB/s Requirement already satisfied: h5py in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from keras-applications>=1.0.8->tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (2.9.0) Requirement already satisfied: setuptools in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from protobuf>=3.6.1->tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (42.0.2.post20191203) Collecting requests<3,>=2.21.0 Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl Collecting google-auth-oauthlib<0.5,>=0.4.1 Using cached https://files.pythonhosted.org/packages/7b/b8/88def36e74bee9fce511c9519571f4e485e890093ab7442284f4ffaef60b/google_auth_oauthlib-0.4.1-py2.py3-none-any.whl Collecting google-auth<2,>=1.6.3 Using cached https://files.pythonhosted.org/packages/36/f8/84b5771faec3eba9fe0c91c8c5896364a8ba08852c0dea5ad2025026dd95/google_auth-1.10.0-py2.py3-none-any.whl Requirement already satisfied: werkzeug>=0.11.15 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (0.16.0) Requirement already satisfied: markdown>=2.6.8 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (3.1.1) Collecting idna<2.9,>=2.5 Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl Requirement already satisfied: certifi>=2017.4.17 in /exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu>=1.5->tensorflow-large-model-support==0.1.0) (2019.11.28) Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 Using cached https://files.pythonhosted.org/packages/b4/40/a9837291310ee1ccc242ceb6ebfd9eb21539649f193a7c8c86ba15b98539/urllib3-1.25.7-py2.py3-none-any.whl Collecting chardet<3.1.0,>=3.0.2 Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl Collecting requests-oauthlib>=0.7.0 Using cached https://files.pythonhosted.org/packages/a3/12/b92740d845ab62ea4edf04d2f4164d82532b5a0b03836d4d4e71c6f3d379/requests_oauthlib-1.3.0-py2.py3-none-any.whl Collecting rsa<4.1,>=3.1.4 Using cached https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl Collecting cachetools<5.0,>=2.0.0 Downloading https://files.pythonhosted.org/packages/08/6a/abf83cb951617793fd49c98cb9456860f5df66ff89883c8660aa0672d425/cachetools-4.0.0-py3-none-any.whl Collecting pyasn1-modules>=0.2.1 Using cached https://files.pythonhosted.org/packages/52/50/bb4cefca37da63a0c52218ba2cb1b1c36110d84dcbae8aa48cd67c5e95c2/pyasn1_modules-0.2.7-py2.py3-none-any.whl Collecting oauthlib>=3.0.0 Using cached https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl Collecting pyasn1>=0.1.3 Using cached https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl Building wheels for collected packages: tensorflow-large-model-support Building wheel for tensorflow-large-model-support (setup.py) ... done Created wheel for tensorflow-large-model-support: filename=tensorflow_large_model_support-0.1.0-cp37-none-any.whl size=17270 sha256=75a236618f321f6b8b3b0d44593c725b52e3fbcdef78ca10ed33664ca7b8e20f Stored in directory: /home/jjia/.cache/pip/wheels/69/41/8c/b952f45ccd8fa39a5d75be005bc14f5d32d37cb57fc5c85513 Successfully built tensorflow-large-model-support

`ERROR: tensorflow 1.15.0 has requirement tensorboard<1.16.0,>=1.15.0, but you'll have tensorboard 2.0.2 which is incompatible. '

'ERROR: tensorflow 1.15.0 has requirement tensorflow-estimator==1.15.1, but you'll have tensorflow-estimator 2.0.1 which is incompatible. '

'ERROR: tensorboard 2.0.2 has requirement grpcio>=1.24.3, but you'll have grpcio 1.16.1 which is incompatible.`

Installing collected packages: tensorflow-estimator, idna, urllib3, chardet, requests, oauthlib, requests-oauthlib, pyasn1, rsa, cachetools, pyasn1-modules, google-auth, google-auth-oauthlib, tensorboard, tensorflow-gpu, toposort, tensorflow-large-model-support Found existing installation: tensorflow-estimator 1.15.1 Uninstalling tensorflow-estimator-1.15.1: Successfully uninstalled tensorflow-estimator-1.15.1 Found existing installation: tensorboard 1.15.0 Uninstalling tensorboard-1.15.0: Successfully uninstalled tensorboard-1.15.0 Successfully installed cachetools-4.0.0 chardet-3.0.4 google-auth-1.10.0 google-auth-oauthlib-0.4.1 idna-2.8 oauthlib-3.1.0 pyasn1-0.4.8 pyasn1-modules-0.2.7 requests-2.22.0 requests-oauthlib-1.3.0 rsa-4.0 tensorboard-2.0.2 tensorflow-estimator-2.0.1 tensorflow-gpu-2.0.0 tensorflow-large-model-support-0.1.0 toposort-1.5 urllib3-1.25.7

after I installed tensorflow-large-model-support package, 'conda list tensorflow' shown that:

tensorboard 2.0.2 pypi_0 pypi tensorflow 1.15.0 gpu_py37h0f0df58_0 tensorflow-base 1.15.0 gpu_py37h9dcbed7_0 tensorflow-estimator 2.0.1 pypi_0 pypi tensorflow-gpu 2.0.0 pypi_0 pypi tensorflow-large-model-support 0.1.0 pypi_0 pypi

You can see that my conda installed tensorflow-gpu, tensorflow-estimator, and tensorboard were replaced by pip installed newer ones.

So I have to downgraded those packages to 1.15.0 again.

Apart from the overwrited package problem, the most important problem is that, even I downgraded those packages and run my codes with tflms, I found that bigger input size still lead to GPU memory exhausted just like I did not use tflms. (I use U-Net to train 3D lung CT with input size 192192112).