IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
201 stars 38 forks source link

No operations were found with optimizer scope training/RMSprop/gradients. #6

Closed aohang111 closed 5 years ago

aohang111 commented 5 years ago

its my first time to use the tensorflow LMS, when try to run the tf.keras-based training script(Keras_resNet50.py) with the command: python3.4 Keras_ResNet50.py --image_size 3900 --lms

its failed with the error below: INFO:tensorflow:[LMS][0] Editing model for LMS INFO:tensorflow:[LMS][0] n_tensors: all tensors INFO:tensorflow:[LMS][0] lb: 1 Traceback (most recent call last): File "Keras_ResNet50.py", line 195, in run_model(args) File "Keras_ResNet50.py", line 141, in run_model epochs=args.epochs, callbacks=get_callbacks(args)) File "/usr/lib/python3.4/site-packages/tensorflow/python/keras/engine/training.py", line 1779, in fit_generator initial_epoch=initial_epoch) File "/usr/lib/python3.4/site-packages/tensorflow/python/keras/engine/training_generator.py", line 98, in fit_generator callbacks.set_model(callback_model) File "/usr/lib/python3.4/site-packages/tensorflow/python/keras/callbacks.py", line 71, in set_model callback.set_model(model) File "/usr/lib/python3.4/site-packages/tensorflow_large_model_support/lms.py", line 957, in set_model lmsMod.run() File "/usr/lib/python3.4/site-packages/tensorflow_large_model_support/lms.py", line 307, in run self._build_gradient_ops() File "/usr/lib/python3.4/site-packages/tensorflow_large_model_support/lms.py", line 169, in _build_gradient_ops 'scope {}.'.format(scope)) ValueError: No operations were found with optimizer scope training/RMSprop/gradients.

seems somethings wrong with the namescope? any suggestion?

environment: Architecture: ppc64le

smatzek commented 5 years ago

I believe this is a TensorFlow version issue. As noted in the Keras_ResNet50.py example it was tested with TensorFlow 1.12, and specifically the TensorFlow 1.12 that comes in IBM PowerAI. What version of TensorFlow are you using?

One note on the 3900 image size. That is what I've observed the max upper limit to be while incrementing the size by 100. It will occasionally fail at that resolution the system setup noted in the script, likely due to GPU garbage collection timing windows. While attempting to recreate you issue this morning I hit an OOM one out of the 4 attempts I tried.

Through its development TFLMS has been tested on TF releases as old as 1.5, but the Keras callback was added later and only tested back as far as 1.8 or 1.10.

The LMSKerasCallback has an option to override the automatic setting of the gradient name scope: https://github.com/IBM/tensorflow-large-model-support/blob/master/tensorflow_large_model_support/lms.py#L929

The name scope is currently set to a default value by this code: https://github.com/IBM/tensorflow-large-model-support/blob/master/tensorflow_large_model_support/lms.py#L952

Which follows the name scope that is used in Keras: https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/python/keras/engine/training.py#L696-L697

So you could attempt to override the gradient name scope, possibly by removing the trailing /gradients, like this:

 diff --git a/examples/Keras_ResNet50.py b/examples/Keras_ResNet50.py
index aa56a6c..008a52a 100644
--- a/examples/Keras_ResNet50.py
+++ b/examples/Keras_ResNet50.py
@@ -114,7 +114,7 @@ def get_callbacks(args):
         # speeds up graph analysis time.
         starting_names = ['bn_conv1/cond/pred_id']
         lms = LMSKerasCallback(n_tensors=args.n_tensors, lb=args.lb,
-                               starting_op_names=starting_names)
+                               starting_op_names=starting_names, optimizer_scopes_override={'training/RMSprop'})
         callbacks.append(lms)

If that still doesn't work you will need to figure out what the optimizer scope name is for the model in your version of tf.keras. When I have done this in the past I add some code to get all ops and dump their names at the start of the LMS.run() method like this:

for n in tf.get_default_graph().as_graph_def().node:
    print(n)
aohang111 commented 5 years ago

@smatzek thanks for your responding, my tf version is 1.10. and keras version is '2.1.6-tf'

first I try to override the gradient name scope with "training/RMSprop", but doesn't work.

then I try to figure out the optimizer scope name , I add the code

print(tf.get_default_graph().as_graph_def()):

before resnet50.fit_generator() in the Keras_ResNet50.py, like this

diff Keras_ResNet50.py Keras_ResNet50b.py 143a144 print(tf.get_default_graph().as_graph_def())

I find the scope name , it is "RMSprop" and it's worked

but I still confused about the running result: I tried to run without LMS, python Keras_ResNet50b.py --image_size 2300 (python version is 3.4) the result is Epoch 1/1 2019-01-10 01:45:06.074104: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.85GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 10/10 [==============================] - 35s 3s/step - loss: 14.9390

I tried to run without LMS, python Keras_ResNet50b.py --image_size 2350 its similar with 2300

but when image_size is 2400, OOM occurred

then I tried to run with LMS enabled python Keras_ResNet50b.py --image_size 2400 --lms the result is 2019-01-10 01:50:15.899743: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.85GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 10/10 [==============================] - 36s 4s/step - loss: 13.2160 it still have the warning.

then I tried to run with LMS enabled python Keras_ResNet50b.py --image_size 2500 --lms OOM occurred. and the log shows 0 tensors will be swapped out(in) to(from) the host

though I set tensor number python Keras_ResNet50b.py --image_size 2500 --lms --n_tensors 20 --lb 30

INFO:tensorflow:[LMS][0] Editing model for LMS INFO:tensorflow:[LMS][0] n_tensors: 20 INFO:tensorflow:[LMS][0] lb: 30 INFO:tensorflow:[LMS][0] Edited model is valid and logically equivalent to the original one INFO:tensorflow:[LMS][0] Added 0 ops into the model INFO:tensorflow:[LMS][0] Editing model for LMS, took: 963.5982513427734 ms INFO:tensorflow:[LMS][0] 0 tensors will be swapped out(in) to(from) the host

from the result, it seem like the LMS doesn't help. why does this happen. is the model not suitable for LMS?

smatzek commented 5 years ago

I investigated the operation names in TF 1.10 by pip installing it onto my laptop (x86), and getting it to dump the operation names. I saw the same thing you do with training/... not being there. I then recalled a PR I worked last year in TensorFlow.

If you are using community builds of TensorFlow 1.10 you will hit the problem you are having. The following PR is needed for Keras callbacks to work for LMS on TensorFlow < 1.12. I will push a PR to change the documentation to reflect this.

https://github.com/tensorflow/tensorflow/pull/21244

The PR above is in TF 1.12. So you have several options.

  1. Upgrade to 1.12
  2. Cherry pick the changes from the PR on top your TF 1.10.
  3. Since you are running on ppc64le, get IBM PowerAI. IBM PowerAI has TensorFlow Large Model Support included in it down the tensorflow.contrib.lms package path and has the PR above included in the TensorFlow levels it contains.
smatzek commented 5 years ago

The Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.85GiB. type warnings you get are warnings, not errors. These are common when you approach 100% GPU memory usage and will occur when you approach this limit whether or not you are using TFLMS.

aohang111 commented 5 years ago

@smatzek from the log ,whether I enable LMS or not. there seems no tensors swap in or out in this model. 0 tensors will be swapped out(in) to(from) the host

smatzek commented 5 years ago

Did you take one of the suggested options in my comment above to obtain the PR 21244? Did you revert changes to override the optimizer scopes?

If you run the unmodified example on ppc64le, with TensorFlow 1.12 as included in PowerAI, the output of the command is:

 python examples/Keras_ResNet50.py --epochs 2 --image_size 3900 --lms
2019-01-11 07:07:01.783871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2019-01-11 07:07:02.219490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
totalMemory: 15.75GiB freeMemory: 15.35GiB
2019-01-11 07:07:02.592721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2019-01-11 07:07:02.975144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2019-01-11 07:07:02.975252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-01-11 07:07:04.083472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-11 07:07:04.083535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3 
2019-01-11 07:07:04.083548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y Y Y 
2019-01-11 07:07:04.083560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N Y Y 
2019-01-11 07:07:04.083572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   Y Y N Y 
2019-01-11 07:07:04.083583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   Y Y Y N 
2019-01-11 07:07:04.085170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14845 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2019-01-11 07:07:04.085584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14845 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2019-01-11 07:07:04.085907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14845 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
2019-01-11 07:07:04.086208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14848 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
INFO:tensorflow:[LMS][0] Editing model for LMS
INFO:tensorflow:[LMS][0] n_tensors: all tensors
INFO:tensorflow:[LMS][0] lb: 1
INFO:tensorflow:[LMS][0] Edited model is valid and logically equivalent to the original one
INFO:tensorflow:[LMS][0] Added 909 ops into the model
INFO:tensorflow:[LMS][0] Editing model for LMS, took: 67726.54247283936 ms
INFO:tensorflow:[LMS][0] 427 tensors will be swapped out(in) to(from) the host
Epoch 1/2
10/10 [==============================] - 63s 6s/step - loss: 12.7381
Epoch 2/2
 6/10 [=================>............] - ETA: 11s - loss: 6.18352019-01-11 07:09:42.808173: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 25.00MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
10/10 [==============================] - 29s 3s/step - loss: 6.0284

This should be the same output you get with TF 1.12 from a community build as well.

aohang111 commented 5 years ago

thanks for your help, I will try as your suggestion.

smatzek commented 5 years ago

The documentation was update to call out a requirement on TensorFlow 1.12 so I'm closing out this issue.

Jingnan-Jia commented 4 years ago

Sorry, I use TensorFlow 1.15 (ubuntu with a 1080 ti GPU with 11 GB RAM), but I still meet the same issue.

python Keras_ResNet50.py --image_size 2000 --lms

` ... 2019-12-29 14:18:06.258543: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1

INFO:tensorflow:[LMS][0] Editing model for LMS INFO:tensorflow:[LMS][0] n_tensors: all tensors INFO:tensorflow:[LMS][0] lb: 1 INFO:tensorflow:[LMS][0] No operations were found with optimizer scope RMSprop/gradients. Traceback (most recent call last): File "Keras_ResNet50.py", line 196, in run_model(args) File "Keras_ResNet50.py", line 142, in run_model epochs=args.epochs, callbacks=get_callbacks(args)) File "/exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 1296, in fit_generator steps_name='steps_per_epoch') File "/exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_generator.py", line 178, in model_iteration mode=mode) File "/exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/keras/callbacks.py", line 105, in configure_callbacks callback_list.set_model(callback_model) File "/exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/keras/callbacks.py", line 219, in set_model callback.set_model(model) File "/exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_large_model_support/lms.py", line 977, in set_model lmsMod.run() File "/exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_large_model_support/lms.py", line 312, in run seed_ops = self._get_seed_ops() File "/exports/lkeb-hpc/jjia/software/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_large_model_support/lms.py", line 200, in _get_seed_ops 'name {}.'.format(name)) ValueError: No starting operation was found with name bn_conv1/cond/pred_id. `

smatzek commented 4 years ago

The operation names created by the model have likely changed between TensorFlow 1.12 / 1.13 and TensorFlow 1.15. That is likely the cause of the issue you are seeing.

However, both TensorFlow 1.14 and 1.15 contain changes to tf.keras and internal optimizers that break TFLMS functionality.

The TensorFlow 1.14 and 1.15 provided by Watson Machine Learning Community Edition (WMLCE) (formerly known as PowerAI), have the necessary fixes and also contain a major version change of the large model support module.

Since you are already using Anaconda and have starred the PowerAI git repository I would suggest installing TensorFlow from Watson Machine Learning Community Edition (a conda install) and installing the tensorflow-large-model-support module as well.

The large model support documentation in WMLCE is here: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.2/navigation/wmlce_getstarted_tflmsv2.html

Its conda channel is here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/

The WMLCE install instructions are here: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.2/navigation/wmlce_install.html

As far as getting this implementation of TFLMS working with TensorFlow 1.14 or 1.15 it would include requiring file patches to be applied to an installed TensorFlow.

Jingnan-Jia commented 4 years ago

@smatzek Thanks very much for your response. I am trying your suggestions. By the way, I found that those examples import tensorflow.python.keras instead of tensorflow.keras. It is my first time to use tensorflow.python.keras. I want to know if I can replace tensorflow.python.keras by tensorflow.keras??? Because most users will recommend to use tensorflow.keras instead of tensorflow.python.keras in this link.

smatzek commented 4 years ago

@Ordgod You should be able to make that replacement and tensorflow.keras is what is in the official API docs.

The use of .python.keras and other .python.* imports is a historical artifact.

As you look more at WML CE you may find that the ResNet50 example has been updated to a ManyModel example and the code that was tested with the latest (TensorFlow 1.15) release of WML CE is here: https://github.com/IBM/powerai/blob/wmlce-1.6.2/examples/tensorflow_large_model_support/v2/Keras_ManyModel.py

Jingnan-Jia commented 4 years ago

@smatzek Thank you very much for your response again. I have installed the powerai and install tensorflow-gpu with tensorflow-large-model-support from powerai packages. I successfully run the test examples. I successfully run the test examples after I replace all tensorflow.python.keras by tensorflow.keras. I successfully run my own codes with lms function on.

update: I solved the OOM by setting mem_ratio=0.5 instead of 0.8.

smatzek commented 4 years ago

@Ordgod I'm glad you were able to get the example and your own code running.

TFLMSv2 (in WMLCE and PowerAI 1.6.0) uses an auto-tuning simulator to attempt to find the optimized tuning values. This simulator depends on accurately calculating your model tensor sizes. It uses static graph analysis to do this and therefore cannot account for some overhead that some of the operations used by your model may trigger. In addition, we have seen some models where the tensor size calculations are done by the simulator do not work well and it then leads to cases like yours, where the auto-tuning determines that no LMS swapping nodes are required, when in fact they are.

In these cases you can sometimes force the auto-tuning to work by setting mem ratio lower as you did, when that doesn't yield results you need to follow the manual tuning instructions in the WML CE documentation and not rely on automatic tuning.