Closed luis-vera closed 3 years ago
@luis-vera can you provide me ur training command line ?. What is ur GPU ?
Hi, I used the following training command:
python examples/tacotron2/train_tacotron2.py --train-dir dump_ljspeech/train/ --dev-dir dump_ljspeech/valid/ --outdir examples/tacotron2/exp/train.tacotron2.v1 --config examples/tacotron2/conf/tacotron2.v1.yaml --use-norm 1 --mixed_precision 0 --resume ""
My PC has NVIDIA GeForce RTX 2080Ti Rev A
2 GPU of the same type joined by a bridge
Did you change the batch size since you're using 2 GPUs?
IF you want to use MultiGPU to training you can replace CUDA_VISIBLE_DEVICES=0 by CUDA_VISIBLE_DEVICES=0,1,2,3 for example. You also need to tune the batch_size for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.
I'm not sure if it's the same for Tacotron2, but for some other algorithms, you need to tune the batch_size
relative to how many GPUs you have.
I vaguely remember a ballpark figure mentioned, that if you have 1 GPU, leave it at the default. If you have 2 GPUs, halve the batch_size. If you have 4 GPUs, quarter the batch_size.
Hi. I changed batch_size from 32 to 16. After I changed batch_size to 8. However in both cases the error message was the same and training stage didn't start.
Model: "tacotron2"
encoder (TFTacotronEncoder) multiple 8218624
decoder_cell (TFTacotronDeco multiple 18246402
post_net (TFTacotronPostnet) multiple 5460480
Total params: 31,966,546 Trainable params: 31,956,306 Non-trainable params: 10,240
[train]: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-11 09:16:50.392068: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: cond/branch_executed/_8
Traceback (most recent call last):
File "examples/tacotron2/train_tacotron2.py", line 503, in
Errors may have originated from an input operation. Input Source operations connected to node mul: tacotron2/transpose (defined at C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\models\tacotron2.py:882) batch (defined at examples/tacotron2/train_tacotron2.py:108)
Input Source operations connected to node mul: tacotron2/transpose (defined at C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\models\tacotron2.py:882) batch (defined at examples/tacotron2/train_tacotron2.py:108)
Function call stack: _one_step_forward -> _one_step_forward
[train]: 0%| | 0/200000 [00:27<?, ?it/s]
Did you pull newest code ? It should fix oom error
I modified examples/tacotron27conf/tacotron2.v1.py only here:
###########################################################
########################################################### batch_size: 8 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1. remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps. allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory. mel_length_threshold: 32 # remove all targets has mel_length <= 32 is_shuffle: true # shuffle dataset after each epoch. use_fixed_shapes: true # use_fixed_shapes for training (2x speed-up)
@luis-vera can you try to add this code below https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/tacotron2/tacotron_dataset.py#L195-L197:
datasets = datasets.filter(
lambda x: x["mel_lengths"] <= 800
)
Do not forget pull newest code.
Hi I tried using batch_size=16 in tacotron2.v1 and I added the following code in tacotron_dataset and the results was:
Model: "tacotron2"
encoder (TFTacotronEncoder) multiple 8218624
decoder_cell (TFTacotronDeco multiple 18246402
post_net (TFTacotronPostnet) multiple 5460480
Total params: 31,966,546 Trainable params: 31,956,306 Non-trainable params: 10,240
[train]: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-11 10:48:41.086001: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: cond/branch_executed/_8
Traceback (most recent call last):
File "examples/tacotron2/train_tacotron2.py", line 503, in
Errors may have originated from an input operation. Input Source operations connected to node mul: tacotron2/transpose (defined at C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\models\tacotron2.py:882) batch (defined at examples/tacotron2/train_tacotron2.py:108)
Function call stack: _one_step_forward
[train]: 0%| | 0/200000 [00:26<?, ?it/s]
As far, I am training the same model in another machine, using Windows 10 without GPU and I don't have had problems. At this moment It's executing 5616/200000 epochs. But I would prefer executing in GPU machine because training stage It's slow. Thanks
I have a question. I am new in this area so I don't understand when you refer about "pull the code". Thanks a lot
Have you tried it on an Ubuntu 18.04 machine, because technically this repo is not tested on Windows. I had issues trying to get FastSpeech2 training to run in Windows 10, which then worked fine in Ubuntu.
Alternatively if you can't get access to an Ubuntu machine, and you're on the Window 10 insider builds (which is required for WSL/Docker GPU passthrough) you could try the TensorFlowTTS docker build.
The docker files can be found here, the same docker image will be good for other scripts in the repo (even though this folder is in the fastspeech2_libritts example).
I have a question. I am new in this area so I don't understand when you refer about "pull the code". Thanks a lot
that means you pull the newest code in github.
I have a question. I am new in this area so I don't understand when you refer about "pull the code". Thanks a lot
Go into your Windows CMD.exe, move into the TensorFlowTTS folder, then run git pull
. This will grab the latest code for the project.
Hi everybody. I want to say you that OOM error was solved but now I haven't be able to solve this recent error message. I'll try with docker today as a second choice.
encoder (TFTacotronEncoder) multiple 8218624
decoder_cell (TFTacotronDeco multiple 18246402
post_net (TFTacotronPostnet) multiple 5460480
Total params: 31,966,546 Trainable params: 31,956,306 Non-trainable params: 10,240
2020-12-14 23:44:50.005917: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_3"
op: "FlatMapDataset"
input: "TensorDataset/_1"
input: "Const/_2"
attr {
key: "Targuments"
value {
list {
type: DT_STRING
}
}
}
attr {
key: "f"
value {
func {
name: "inference_Dataset_flat_map_flat_map_fn_29"
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
unknown_rank: true
}
shape {
unknown_rank: true
}
shape {
unknown_rank: true
}
}
}
}
attr {
key: "output_types"
value {
list {
type: DT_STRING
type: DT_STRING
type: DT_STRING
}
}
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options()
object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
before applying the options object to the dataset via dataset.with_options(options)
.
2020-12-14 23:44:50.066821: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_3"
op: "FlatMapDataset"
input: "TensorDataset/_1"
input: "Const/_2"
attr {
key: "Targuments"
value {
list {
type: DT_STRING
}
}
}
attr {
key: "f"
value {
func {
name: "inference_Dataset_flat_map_flat_map_fn_293"
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
unknown_rank: true
}
shape {
unknown_rank: true
}
shape {
unknown_rank: true
}
}
}
}
attr {
key: "output_types"
value {
list {
type: DT_STRING
type: DT_STRING
type: DT_STRING
}
}
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options()
object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
before applying the options object to the dataset via dataset.with_options(options)
.
[train]: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-14 23:44:50.181980: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2020-12-14 23:44:50.260349: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: cond/branch_executed/_8
Traceback (most recent call last):
File "examples/tacotron2/train_tacotron2.py", line 503, in
Function call stack: _one_step_forward -> _one_step_forward -> _one_step_forward
[train]: 0%| | 0/200000 [00:23<?, ?it/s]
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
HI. I am training dataset using tacotron2 but this stage didn't start because a OOM error . I updated tacotron_dataset.py as you told to another person but the error is the same:
2020-12-08 15:56:49.608579: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Stats: Limit: 8999199884 InUse: 8989584640 MaxInUse: 8998987264 NumAllocs: 82683 MaxAllocSize: 627179520 Reserved: 0 PeakReserved: 0 LargestFreeBlock: 0
2020-12-08 15:56:49.619116: W tensorflow/core/common_runtime/bfc_allocator.cc:439] **** 2020-12-08 15:56:49.623677: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[32,193,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "examples/tacotron2/train_tacotron2.py", line 503, in
main()
File "examples/tacotron2/train_tacotron2.py", line 491, in main
trainer.fit(
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 999, in fit
self.run()
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 103, in run
self._train_epoch()
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 125, in _train_epoch
self._train_step(batch)
File "examples/tacotron2/train_tacotron2.py", line 108, in _train_step
self.one_step_forward(batch)
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in call
result = self._call(*args, *kwds)
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 840, in _call
return self._stateless_fn(args, **kwds)
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call
return self._call_flat(
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call
outputs = execute.execute(
File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[32,193,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node tacotron2/decoder/while/body/_1/tacotron2/decoder/while/decoder_cell/mul_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[32,193,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node tacotron2/decoder/while/body/_1/tacotron2/decoder/while/decoder_cell/mul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations. 0 derived errors ignored. [Op:inferenceone_step_forward_21637]
Function call stack: _one_step_forward -> _one_step_forward
[train]: 0%| | 0/200000 [02:47<?, ?it/s]
Thanks a lot for your help. Luis Vera