Open kelayamatoz opened 7 years ago
It seems that the ExponentailMovingAverage doesn't help with the training processs. Disabling the related lines would solve this issue.
After resolving this issue, I also discovered a similar reuse-variable issue with the Adam Optimizer. It seems that there is an implicit global variable scope that's enforcing all the variables, including the ones created by the optimizers, to be reusable. Adding an explicit declaration of variable scope before the for-loop of gpu model creation would solve the issue.
hello: I meet the same problem as you, but I don't understand your solution. Which line did you disable to solve this issue? Can you say more details? Thanks a lot!
Hey @distantJing:
Check this one:
https://github.com/kelayamatoz/bi-att-flow-lstm-extractor/blob/master/basic/model.py#L25-L36
Essentially you need to put the exponential smoothing variables into a different scope and make sure that each GPU gets a unique set of loss variables.
Dear Team,
I tried to train the model with 2 gpus using the following line:
/opt/python3/bin/python3 -m basic.cli --mode train --noload --debug --len_opt --batch_size 20 --num_gpus 2
Then I got an error:
ValueError: Attempt to have a second RNNCell use the weights of a variable scope that already has weights: 'prepro/u1/fw/basic_lstm_cell'; and the cell was not constructed as BasicLSTMCell(..., reuse=True). To share the weights of an RNNCell, simply reuse it in your second calculation, or create a new one with the argument reuse=True.
I know it has something to do with multi-gpu training. But I don't know how to revise the code.
Thanks a lot!
@kelayamatoz I found your repository https://github.com/kelayamatoz/BiDAF-MultiGPU-Fix And i ran it up on multi-gpus Thanks a lot !
@chiahsuan156 i run the repository as your reffered https://github.com/kelayamatoz/BiDAF-MultiGPU-Fix,but i meet the same error as @dengyuning do have you resolved it?
Dear Team,
I am running training with 2 K80 Nvidia GPUs. I tried both dev branch with tf 1.2.0, python 3.6.2 with the following line:
python3 -m basic.cli --mode train --noload --num_gpus 2 --batch_size 30
However the program quits with the errors attached. We are a bit confused on how to track what's causing the error, and we are wondering if we could get some help?
Here starts the log of the error:
Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/cli.py", line 112, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/cli.py", line 109, in main
m(config)
File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/main.py", line 24, in main
_train(config)
File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/main.py", line 83, in _train
models = get_multi_gpu_models(config)
File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/model.py", line 21, in get_multi_gpu_models
model = Model(config, scope, rep=gpu_idx == 0)
File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/model.py", line 68, in init
self._build_ema()
File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/model.py", line 298, in _build_ema
ema_op = ema.apply(tensors)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/moving_averages.py", line 375, in apply
colocate_with_primary=(var.op.type in ["Variable", "VariableV2"]))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/slot_creator.py", line 174, in create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/slot_creator.py", line 149, in create_slot_with_initializer
dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/slot_creator.py", line 66, in _create_slot_var
validate_shape=validate_shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
use_resource=use_resource)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 682, in _get_single_variable
"VarScope?" % name)
ValueError: Variable model_1/loss/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?