ValueError when running training with Multi-GPU

kelayamatoz commented 7 years ago

Dear Team,

I am running training with 2 K80 Nvidia GPUs. I tried both dev branch with tf 1.2.0, python 3.6.2 with the following line:

python3 -m basic.cli --mode train --noload --num_gpus 2 --batch_size 30

However the program quits with the errors attached. We are a bit confused on how to track what's causing the error, and we are wondering if we could get some help?

Here starts the log of the error:

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/cli.py", line 112, in tf.app.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/cli.py", line 109, in main m(config) File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/main.py", line 24, in main _train(config) File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/main.py", line 83, in _train models = get_multi_gpu_models(config) File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/model.py", line 21, in get_multi_gpu_models model = Model(config, scope, rep=gpu_idx == 0) File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/model.py", line 68, in init self._build_ema() File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/model.py", line 298, in _build_ema ema_op = ema.apply(tensors) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/moving_averages.py", line 375, in apply colocate_with_primary=(var.op.type in ["Variable", "VariableV2"])) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/slot_creator.py", line 174, in create_zeros_slot colocate_with_primary=colocate_with_primary) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/slot_creator.py", line 149, in create_slot_with_initializer dtype) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/slot_creator.py", line 66, in _create_slot_var validate_shape=validate_shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable use_resource=use_resource, custom_getter=custom_getter) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable use_resource=use_resource, custom_getter=custom_getter) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable validate_shape=validate_shape, use_resource=use_resource) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter use_resource=use_resource) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 682, in _get_single_variable "VarScope?" % name) ValueError: Variable model_1/loss/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

kelayamatoz commented 7 years ago

It seems that the ExponentailMovingAverage doesn't help with the training processs. Disabling the related lines would solve this issue.

After resolving this issue, I also discovered a similar reuse-variable issue with the Adam Optimizer. It seems that there is an implicit global variable scope that's enforcing all the variables, including the ones created by the optimizers, to be reusable. Adding an explicit declaration of variable scope before the for-loop of gpu model creation would solve the issue.

distantJing commented 7 years ago

hello: I meet the same problem as you, but I don't understand your solution. Which line did you disable to solve this issue? Can you say more details? Thanks a lot!

kelayamatoz commented 7 years ago

Hey @distantJing:

Check this one:

https://github.com/kelayamatoz/bi-att-flow-lstm-extractor/blob/master/basic/model.py#L25-L36

kelayamatoz commented 7 years ago

Essentially you need to put the exponential smoothing variables into a different scope and make sure that each GPU gets a unique set of loss variables.

dengyuning commented 7 years ago

Dear Team, I tried to train the model with 2 gpus using the following line: /opt/python3/bin/python3 -m basic.cli --mode train --noload --debug --len_opt --batch_size 20 --num_gpus 2 Then I got an error: ValueError: Attempt to have a second RNNCell use the weights of a variable scope that already has weights: 'prepro/u1/fw/basic_lstm_cell'; and the cell was not constructed as BasicLSTMCell(..., reuse=True). To share the weights of an RNNCell, simply reuse it in your second calculation, or create a new one with the argument reuse=True. I know it has something to do with multi-gpu training. But I don't know how to revise the code. Thanks a lot!

Chia-Hsuan-Lee commented 6 years ago

@kelayamatoz I found your repository https://github.com/kelayamatoz/BiDAF-MultiGPU-Fix And i ran it up on multi-gpus Thanks a lot !

houzhenzhen commented 6 years ago

@chiahsuan156 i run the repository as your reffered https://github.com/kelayamatoz/BiDAF-MultiGPU-Fix,but i meet the same error as @dengyuning do have you resolved it?

allenai / bi-att-flow

ValueError when running training with Multi-GPU #44