coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.7k stars 4.21k forks source link

[Bug] cuFFT error #2365

Closed mesut92 closed 1 year ago

mesut92 commented 1 year ago

Describe the bug

I am trying to train vits with ljspeech on 4090. i am getting that error, i could not fix. I update the torch and nvidia drivers.

To Reproduce

run this code: python recipes/turk/vits_tts/train_vits.py

getting this error /usr/local/lib/python3.8/dist-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] ! Run is removed from /media/mesut/Depo1/works/TTS/recipes/turk/vits_tts/vits_ljspeech-February-26-2023_08+55AM-0000000 Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1591, in fit self._fit() File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1544, in _fit self.train_epoch() File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1309, in trainepoch , _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 1126, in train_step batch = self.format_batch(batch) File "/usr/local/lib/python3.8/dist-packages/trainer/trainer.py", line 926, in format_batch batch = self.model.format_batch_on_device(batch) File "/media/mesut/Depo1/works/TTS/TTS/tts/models/vits.py", line 1503, in format_batch_on_device batch["spec"] = wav_to_spec(wav, ac.fft_size, ac.hop_length, ac.win_length, center=False) File "/media/mesut/Depo1/works/TTS/TTS/tts/models/vits.py", line 123, in wav_to_spec spec = torch.stft( File "/usr/local/lib/python3.8/dist-packages/torch/functional.py", line 632, in stft return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

Expected behavior

start to train

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "TTS": "0.11.1",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.10",
        "version": "#66~20.04.1-Ubuntu SMP Wed Jan 25 09:41:30 UTC 2023"
    }
}

Additional context

No response

erogol commented 1 year ago

can't reproduce. In general it is OOM issue

pathnirvana commented 1 year ago

I am getting the same error on a rtx 4090 on the ljspeech dataset using the !CUDA_VISIBLE_DEVICES=0 python3 recipes/ljspeech/vits_tts/train_vits.py

edit: a solution is mentioned here