TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

failed to initialize batched cufft plan with customized allocator #711

Closed abaddon-moriarty closed 2 years ago

abaddon-moriarty commented 2 years ago

Hello everyone, I am currently training a phoneme-based HiFi-GAN model and I recently ran into the following issue. It started when I tried using multiple GPUs, but now I can't even train on a single GPU.

It is written that it is reducing the batch size, but these are the settings in my hifigan.v1.yaml file

image

I saw this issue having thr same failed to initialize batched cufft plan with customized allocator, but for them the GPU runs out of memory, which is not the case for me. I also saw in this issue that there was a problem in the batch_max_steps_valid, but I've used the same file to train other vocoders and it is the first time this error arises, what should be the correct value ?

INFO:tensorflow:batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1 2021-11-29 16:45:53,870 (cross_device_ops:702) INFO: batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1 INFO:tensorflow:batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1 2021-11-29 16:46:15,996 (cross_device_ops:702) INFO: batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1 INFO:tensorflow:batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1 2021-11-29 16:46:53,329 (cross_device_ops:702) INFO: batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1 INFO:tensorflow:batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1 2021-11-29 16:47:14,118 (cross_device_ops:702) INFO: batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1 2021-11-29 16:48:33.400178: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-11-29 16:48:38.176008: E tensorflow/stream_executor/cuda/cuda_fft.cc:223] failed to make cuFFT batched plan:5 2021-11-29 16:48:38.176052: E tensorflow/stream_executor/cuda/cuda_fft.cc:426] Initialize Params: rank: 1 elem_count: 2048 input_embed: 2048 input_stride: 1 input_distance: 2048 output_embed: 1025 output_stride: 1 output_distance: 1025 batch_count: 480 2021-11-29 16:48:38.176062: F tensorflow/stream_executor/cuda/cuda_fft.cc:435] failed to initialize batched cufft plan with customized allocator: Failed to make cuFFT batched plan.

Any ideas on how to correct this ? Thank you

dathudeptrai commented 2 years ago

@ZDisket do you know what is a problem here ?

abaddon-moriarty commented 2 years ago

I have re-initialised everything and started from zero again, I no longer have this issue.