TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.82k stars 812 forks source link

OOM error while training tacotron2 #420

Closed luis-vera closed 3 years ago

luis-vera commented 3 years ago

HI. I am training dataset using tacotron2 but this stage didn't start because a OOM error . I updated tacotron_dataset.py as you told to another person but the error is the same:

2020-12-08 15:56:49.608579: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Stats: Limit: 8999199884 InUse: 8989584640 MaxInUse: 8998987264 NumAllocs: 82683 MaxAllocSize: 627179520 Reserved: 0 PeakReserved: 0 LargestFreeBlock: 0

2020-12-08 15:56:49.619116: W tensorflow/core/common_runtime/bfc_allocator.cc:439] **** 2020-12-08 15:56:49.623677: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[32,193,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "examples/tacotron2/train_tacotron2.py", line 503, in main() File "examples/tacotron2/train_tacotron2.py", line 491, in main trainer.fit( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 999, in fit self.run() File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 103, in run self._train_epoch() File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 125, in _train_epoch self._train_step(batch) File "examples/tacotron2/train_tacotron2.py", line 108, in _train_step self.one_step_forward(batch) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in call result = self._call(*args, *kwds) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 840, in _call return self._stateless_fn(args, **kwds) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call return self._call_flat( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat return self._build_call_outputs(self._inference_function.call( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call outputs = execute.execute( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[32,193,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node tacotron2/decoder/while/body/_1/tacotron2/decoder/while/decoder_cell/mul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[tacotron2/decoder/while/LoopCond/_179/_40]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[32,193,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node tacotron2/decoder/while/body/_1/tacotron2/decoder/while/decoder_cell/mul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored. [Op:inferenceone_step_forward_21637]

Function call stack: _one_step_forward -> _one_step_forward

[train]: 0%| | 0/200000 [02:47<?, ?it/s]

Thanks a lot for your help. Luis Vera

dathudeptrai commented 3 years ago

@luis-vera can you provide me ur training command line ?. What is ur GPU ?

luis-vera commented 3 years ago

Hi, I used the following training command:

python examples/tacotron2/train_tacotron2.py --train-dir dump_ljspeech/train/ --dev-dir dump_ljspeech/valid/ --outdir examples/tacotron2/exp/train.tacotron2.v1 --config examples/tacotron2/conf/tacotron2.v1.yaml --use-norm 1 --mixed_precision 0 --resume ""

My PC has NVIDIA GeForce RTX 2080Ti Rev A

luis-vera commented 3 years ago

2 GPU of the same type joined by a bridge

OscarVanL commented 3 years ago

Did you change the batch size since you're using 2 GPUs?

IF you want to use MultiGPU to training you can replace CUDA_VISIBLE_DEVICES=0 by CUDA_VISIBLE_DEVICES=0,1,2,3 for example. You also need to tune the batch_size for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.

I'm not sure if it's the same for Tacotron2, but for some other algorithms, you need to tune the batch_size relative to how many GPUs you have.

I vaguely remember a ballpark figure mentioned, that if you have 1 GPU, leave it at the default. If you have 2 GPUs, halve the batch_size. If you have 4 GPUs, quarter the batch_size.

luis-vera commented 3 years ago

Hi. I changed batch_size from 32 to 16. After I changed batch_size to 8. However in both cases the error message was the same and training stage didn't start.

Model: "tacotron2"


Layer (type) Output Shape Param #

encoder (TFTacotronEncoder) multiple 8218624


decoder_cell (TFTacotronDeco multiple 18246402


post_net (TFTacotronPostnet) multiple 5460480


residual_projection (Dense) multiple 41040

Total params: 31,966,546 Trainable params: 31,956,306 Non-trainable params: 10,240


[train]: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-11 09:16:50.392068: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: cond/branch_executed/_8 Traceback (most recent call last): File "examples/tacotron2/train_tacotron2.py", line 503, in main() File "examples/tacotron2/train_tacotron2.py", line 491, in main trainer.fit( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 999, in fit self.run() File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 103, in run self._train_epoch() File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 125, in _train_epoch self._train_step(batch) File "examples/tacotron2/train_tacotron2.py", line 108, in _train_step self.one_step_forward(batch) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in call result = self._call(*args, *kwds) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 840, in _call return self._stateless_fn(args, **kwds) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call return self._call_flat( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat return self._build_call_outputs(self._inference_function.call( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call outputs = execute.execute( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Incompatible shapes: [16,193,1161] vs. [16,193,2000] [[node mul (defined at examples/tacotron2/train_tacotron2.py:179) ]] [[tacotron2/decoder/while/body/_1/tacotron2/decoder/while/decoder_cell/assert_positive/assert_less/Assert/Assert/_79]] (1) Invalid argument: Incompatible shapes: [16,193,1161] vs. [16,193,2000] [[node mul (defined at examples/tacotron2/train_tacotron2.py:179) ]] 0 successful operations. 0 derived errors ignored. [Op:inferenceone_step_forward_23239]

Errors may have originated from an input operation. Input Source operations connected to node mul: tacotron2/transpose (defined at C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\models\tacotron2.py:882) batch (defined at examples/tacotron2/train_tacotron2.py:108)

Input Source operations connected to node mul: tacotron2/transpose (defined at C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\models\tacotron2.py:882) batch (defined at examples/tacotron2/train_tacotron2.py:108)

Function call stack: _one_step_forward -> _one_step_forward

[train]: 0%| | 0/200000 [00:27<?, ?it/s]

dathudeptrai commented 3 years ago

Did you pull newest code ? It should fix oom error

luis-vera commented 3 years ago

I modified examples/tacotron27conf/tacotron2.v1.py only here:

###########################################################

DATA LOADER SETTING

########################################################### batch_size: 8 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1. remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps. allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory. mel_length_threshold: 32 # remove all targets has mel_length <= 32 is_shuffle: true # shuffle dataset after each epoch. use_fixed_shapes: true # use_fixed_shapes for training (2x speed-up)

refer (https://github.com/dathudeptrai/TensorflowTTS/issues/34#issuecomment-642309118)

dathudeptrai commented 3 years ago

@luis-vera can you try to add this code below https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/tacotron2/tacotron_dataset.py#L195-L197:

        datasets = datasets.filter(
            lambda x: x["mel_lengths"] <= 800
        )

Do not forget pull newest code.

luis-vera commented 3 years ago

Hi I tried using batch_size=16 in tacotron2.v1 and I added the following code in tacotron_dataset and the results was:

Model: "tacotron2"


Layer (type) Output Shape Param #

encoder (TFTacotronEncoder) multiple 8218624


decoder_cell (TFTacotronDeco multiple 18246402


post_net (TFTacotronPostnet) multiple 5460480


residual_projection (Dense) multiple 41040

Total params: 31,966,546 Trainable params: 31,956,306 Non-trainable params: 10,240


[train]: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-11 10:48:41.086001: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: cond/branch_executed/_8 Traceback (most recent call last): File "examples/tacotron2/train_tacotron2.py", line 503, in main() File "examples/tacotron2/train_tacotron2.py", line 491, in main trainer.fit( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 999, in fit self.run() File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 103, in run self._train_epoch() File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 125, in _train_epoch self._train_step(batch) File "examples/tacotron2/train_tacotron2.py", line 108, in _train_step self.one_step_forward(batch) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in call result = self._call(*args, *kwds) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\def_function.py", line 840, in _call return self._stateless_fn(args, **kwds) File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call return self._call_flat( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat return self._build_call_outputs(self._inference_function.call( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call outputs = execute.execute( File "C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [16,193,1161] vs. [16,193,2000] [[node mul (defined at examples/tacotron2/train_tacotron2.py:179) ]] [Op:inferenceone_step_forward_23239]

Errors may have originated from an input operation. Input Source operations connected to node mul: tacotron2/transpose (defined at C:\Users\Voice-trainner\anaconda3\envs\Ambiente1\lib\site-packages\tensorflow_tts\models\tacotron2.py:882) batch (defined at examples/tacotron2/train_tacotron2.py:108)

Function call stack: _one_step_forward

[train]: 0%| | 0/200000 [00:26<?, ?it/s] screen_dec11

luis-vera commented 3 years ago

As far, I am training the same model in another machine, using Windows 10 without GPU and I don't have had problems. At this moment It's executing 5616/200000 epochs. But I would prefer executing in GPU machine because training stage It's slow. Thanks

luis-vera commented 3 years ago

I have a question. I am new in this area so I don't understand when you refer about "pull the code". Thanks a lot

OscarVanL commented 3 years ago

Have you tried it on an Ubuntu 18.04 machine, because technically this repo is not tested on Windows. I had issues trying to get FastSpeech2 training to run in Windows 10, which then worked fine in Ubuntu.

Alternatively if you can't get access to an Ubuntu machine, and you're on the Window 10 insider builds (which is required for WSL/Docker GPU passthrough) you could try the TensorFlowTTS docker build.

The docker files can be found here, the same docker image will be good for other scripts in the repo (even though this folder is in the fastspeech2_libritts example).

dathudeptrai commented 3 years ago

I have a question. I am new in this area so I don't understand when you refer about "pull the code". Thanks a lot

that means you pull the newest code in github.

OscarVanL commented 3 years ago

I have a question. I am new in this area so I don't understand when you refer about "pull the code". Thanks a lot

Go into your Windows CMD.exe, move into the TensorFlowTTS folder, then run git pull. This will grab the latest code for the project.

luis-vera commented 3 years ago

Hi everybody. I want to say you that OOM error was solved but now I haven't be able to solve this recent error message. I'll try with docker today as a second choice.


Layer (type) Output Shape Param #

encoder (TFTacotronEncoder) multiple 8218624


decoder_cell (TFTacotronDeco multiple 18246402


post_net (TFTacotronPostnet) multiple 5460480


residual_projection (Dense) multiple 41040

Total params: 31,966,546 Trainable params: 31,956,306 Non-trainable params: 10,240


2020-12-14 23:44:50.005917: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_3" op: "FlatMapDataset" input: "TensorDataset/_1" input: "Const/_2" attr { key: "Targuments" value { list { type: DT_STRING } } } attr { key: "f" value { func { name: "inference_Dataset_flat_map_flat_map_fn_29" } } } attr { key: "output_shapes" value { list { shape { unknown_rank: true } shape { unknown_rank: true } shape { unknown_rank: true } } } } attr { key: "output_types" value { list { type: DT_STRING type: DT_STRING type: DT_STRING } } } . Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options() object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA before applying the options object to the dataset via dataset.with_options(options). 2020-12-14 23:44:50.066821: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_3" op: "FlatMapDataset" input: "TensorDataset/_1" input: "Const/_2" attr { key: "Targuments" value { list { type: DT_STRING } } } attr { key: "f" value { func { name: "inference_Dataset_flat_map_flat_map_fn_293" } } } attr { key: "output_shapes" value { list { shape { unknown_rank: true } shape { unknown_rank: true } shape { unknown_rank: true } } } } attr { key: "output_types" value { list { type: DT_STRING type: DT_STRING type: DT_STRING } } } . Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options() object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA before applying the options object to the dataset via dataset.with_options(options). [train]: 0%| | 0/200000 [00:00<?, ?it/s]2020-12-14 23:44:50.181980: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2020-12-14 23:44:50.260349: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: cond/branch_executed/_8 Traceback (most recent call last): File "examples/tacotron2/train_tacotron2.py", line 503, in main() File "examples/tacotron2/train_tacotron2.py", line 491, in main trainer.fit( File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 999, in fit self.run() File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 103, in run self._train_epoch() File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow_tts\trainers\base_trainer.py", line 125, in _train_epoch self._train_step(batch) File "examples/tacotron2/train_tacotron2.py", line 108, in _train_step self.one_step_forward(batch) File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in call result = self._call(*args, *kwds) File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call return self._stateless_fn(args, **kwds) File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2942, in call return graph_function._call_flat( File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow\python\eager\function.py", line 1918, in _call_flat return self._build_call_outputs(self._inference_function.call( File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow\python\eager\function.py", line 555, in call outputs = execute.execute( File "C:\Users\Voice-trainner\anaconda3\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access element 138 in a list with 138 elements. [[{{node while_19/body/_1/while/TensorArrayV2Read_1/TensorListGetItem}}]] [[tacotron2/encoder/bilstm/forward_lstm/PartitionedCall]] [Op:inferenceone_step_forward_23120]

Function call stack: _one_step_forward -> _one_step_forward -> _one_step_forward

[train]: 0%| | 0/200000 [00:23<?, ?it/s]

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.