Closed ZDisket closed 4 years ago
@ZDisket if you use gradient_accumulation_steps: 1
the training behavior is the same as the old version. SO the bug shouldn't cause by gradient accumulator.
here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L836) and here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L859), let try to replace it by:
zip(gradients, self._trainable_variables), 1.0
@dathudeptrai
if you use gradient_accumulation_steps: 1 the training behavior is the same as the old version. SO the bug shouldn't cause by gradient accumulator.
Then what's the correct value? Also, I'll try that solution.
@dathudeptrai
here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L836) and here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L859), let try to replace it by:
zip(gradients, self._trainable_variables), 1.0
Now I'm getting OOM, so I guess that works.
@dathudeptrai
here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L836) and here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L859), let try to replace it by: zip(gradients, self._trainable_variables), 1.0
Now I'm getting OOM, so I guess that works.
so let try to set batch_size 16 and gradient_accummulate: 8 :D then you can training with batch-size 128
@dathudeptrai Another error shortly in training
[train]: 0% 0/150000 [00:00<?, ?it/s]2020-11-25 03:57:00.978225: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do
2020-11-25 03:57:00.983575: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do
2020-11-25 03:57:10.944365: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2036 of 12445
2020-11-25 03:57:20.939860: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 4161 of 12445
2020-11-25 03:57:30.938864: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 6315 of 12445
2020-11-25 03:57:40.941579: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 8393 of 12445
2020-11-25 03:57:50.942838: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 10446 of 12445
2020-11-25 03:58:00.675132: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-11-25 03:58:31.339181: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 2354/30897 nodes to float16 precision using 224 cast(s) to float16 (excluding Const and Variable casts)
2020-11-25 03:58:39.482413: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/23290 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts)
[train]: 0% 1/150000 [01:48<4517:25:13, 108.42s/it]2020-11-25 03:58:51.647473: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 1172/11808 nodes to float16 precision using 113 cast(s) to float16 (excluding Const and Variable casts)
2020-11-25 03:58:54.483251: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/9872 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts)
[train]: 0% 97/150000 [06:43<116:08:00, 2.79s/it]2020-11-25 04:03:45.665215: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at constant_op.cc:185 : Invalid argument: Dimension -2147483648 must be >= 0
Traceback (most recent call last):
File "/content/TensorflowTTS/ttsexamples/fastspeech2/train_fastspeech2.py", line 436, in <module>
main()
File "/content/TensorflowTTS/ttsexamples/fastspeech2/train_fastspeech2.py", line 428, in main
resume=args.resume,
File "/content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 1002, in fit
self.run()
File "/content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 103, in run
self._train_epoch()
File "/content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 125, in _train_epoch
self._train_step(batch)
File "/content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 780, in _train_step
self.one_step_forward(batch)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 807, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Dimension -2147483648 must be >= 0
[[{{node while/body/_1/while/tf_fast_speech2_1/length_regulator/zeros_1}}]]
[[Func/while/body/_1/output_control_node/_2498/_503]]
(1) Invalid argument: Dimension -2147483648 must be >= 0
[[{{node while/body/_1/while/tf_fast_speech2_1/length_regulator/zeros_1}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference__one_step_forward_46981]
Function call stack:
_one_step_forward -> _one_step_forward
[train]: 0% 97/150000 [06:45<174:05:03, 4.18s/it]
@ZDisket did you pull the newest code in master, seems the bug come from dataloader.
@dathudeptrai Hi, I have same problem.
what should i do?
[train]: 0%| | 0/200000 [00:00<?, ?it/s]2020-11-25 05:12:33.624339: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do
2020-11-25 05:12:33.636275: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do
2020-11-25 05:12:43.546755: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 796 of 12209
2020-11-25 05:12:53.534750: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 1609 of 12209
2020-11-25 05:13:03.578700: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 2442 of 12209
2020-11-25 05:13:13.575282: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 3256 of 12209
2020-11-25 05:13:23.532050: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 4079 of 12209
2020-11-25 05:13:33.536464: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 4887 of 12209
2020-11-25 05:13:43.601250: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 5668 of 12209
2020-11-25 05:13:53.540969: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 6495 of 12209
2020-11-25 05:14:03.559303: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 7341 of 12209
2020-11-25 05:14:13.577042: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 8170 of 12209
2020-11-25 05:14:23.530453: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 9012 of 12209
2020-11-25 05:14:33.614216: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 9779 of 12209
2020-11-25 05:14:43.539679: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 10597 of 12209
2020-11-25 05:14:53.548933: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 11419 of 12209
2020-11-25 05:15:03.032730: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:432: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
warnings.warn(
Traceback (most recent call last):
File "examples/fastspeech2/train_fastspeech2.py", line 417, in
/root/anaconda3/lib/python3.8/site-packages/tensorflow_tts/trainers/base_trainer.py:788 _one_step_forward *
per_replica_losses = self._strategy.run(
/root/anaconda3/lib/python3.8/site-packages/tensorflow_tts/trainers/base_trainer.py:835 _one_step_forward_per_replica *
self._optimizer.apply_gradients(
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:378 apply_gradients **
return distribution_strategy_context.get_replica_context().merge_call(
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2715 merge_call
return self._merge_call(merge_fn, args, kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2722 _merge_call
return merge_fn(self._strategy, *args, **kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:408 _apply_gradients_cross_replica **
maybe_apply_op = smart_cond.smart_cond(should_apply_grads,
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/smart_cond.py:58 smart_cond
return control_flow_ops.cond(pred, true_fn=true_fn, false_fn=false_fn,
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
return target(*args, **kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:507 new_func
return func(*args, **kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/control_flow_ops.py:1180 cond
return cond_v2.cond_v2(pred, true_fn, false_fn, name)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/cond_v2.py:79 cond_v2
true_graph = func_graph_module.func_graph_from_py_func(
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py:986 func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:394 apply_fn
return distribution.extended.call_for_each_replica(
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2585 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/one_device_strategy.py:367 _call_for_each_replica
return fn(*args, **kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:418 _apply_gradients
return self._optimizer.apply_gradients(
/root/anaconda3/lib/python3.8/site-packages/tensorflow_tts/optimizers/adamweightdecay.py:124 apply_gradients
(grads, _) = tf.clip_by_global_norm(grads, clip_norm=clip_norm)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
return target(*args, **kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/clip_ops.py:352 clip_by_global_norm
constant_op.constant(1.0, dtype=use_norm.dtype) / clip_norm)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/math_ops.py:1124 binary_op_wrapper
return func(x, y, name=name)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
return target(*args, **kwargs)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/math_ops.py:1296 truediv
return _truediv_python3(x, y, name)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/math_ops.py:1222 _truediv_python3
y = ops.convert_to_tensor(y, dtype_hint=x.dtype.base_dtype, name="y")
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:1499 convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py:338 _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py:263 constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py:280 _constant_impl
tensor_util.make_tensor_proto(
/root/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/tensor_util.py:444 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
[train]: 0%| | 0/200000 [02:40<?, ?it/s]
@Zegalryang pull newest code :)).
@dathudeptrai I pulled already newest code commit : ea72bab6cef40dff68b0b619ecf0d7c9cce3e3f0
@dathudeptrai I pulled already newest code commit : ea72bab
pull and pip install -e .
The newest code fixed ur problem.
@dathudeptrai
did you pull the newest code in master, seems the bug come from dataloader.
Now it works well.
@dathudeptrai it works!! thanks!!
@dathudeptrai Training with gradient accumulator for effective batch_size 128 is slow, about 2.7s/it, on a GPU that would normally get 2.9it/s. Is this normal?
@dathudeptrai Training with gradient accumulator for effective batch_size 128 is slow, about 2.7s/it, on a GPU that would normally get 2.9it/s. Is this normal?
normally, we train with batch_size 16 so you can obtain 3it/s but now you are training with batch-size 128 so 2.7s/it is normal.
I tried training FastSpeech2 on LJSpeech resampled to 24KHz with
gradient_accumulation_steps: 1
and batch size 128 with mixed precision on a Tesla T4 (14GB of VRAM) and got this:Any ideas?