Closed kin0303 closed 2 years ago
Hey, could you please try with the latest CUDA 11 version?
Hey, could you please try with the latest CUDA 11 version?
Okey, I'll try with the CUDA 11.3 version
Hey, could you please try with the latest CUDA 11 version?
Okey, I'll try with the CUDA 11.3 version
I had trying, but still error
--> STEP: 267/3243 -- GLOBAL_STEP: 3510
| > decoder_loss: 33.00844 (33.08039)
| > postnet_loss: 35.04267 (35.08060)
| > stopnet_loss: 0.81244 (0.85546)
| > decoder_coarse_loss: 32.94590 (33.04457)
| > decoder_ddc_loss: 0.00334 (0.00528)
| > ga_loss: 0.00708 (0.01021)
| > decoder_diff_spec_loss: 0.41570 (0.43186)
| > postnet_diff_spec_loss: 4.45244 (4.44248)
| > decoder_ssim_loss: 0.99999 (0.99990)
| > postnet_ssim_loss: 0.99990 (0.99950)
| > loss: 27.81493 (27.92766)
| > align_error: 0.97789 (0.96763)
| > grad_norm: 4.54514 (5.27340)
| > current_lr: 0.00000
| > step_time: 0.30480 (0.25091)
| > loader_time: 0.00150 (0.03358)
! Run is kept in /media/DATA-2/TTS/coqui/TTS/run-April-22-2022_01+20PM-0cf3265a
Traceback (most recent call last):
File "/media/DATA-2/TTS/coqui/tts_coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1485, in fit
self._fit()
File "/media/DATA-2/TTS/coqui/tts_coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1469, in _fit
self.train_epoch()
File "/media/DATA-2/TTS/coqui/tts_coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1248, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/media/DATA-2/TTS/coqui/tts_coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1081, in train_step
outputs, loss_dict_new, step_time = self._optimize(
File "/media/DATA-2/TTS/coqui/tts_coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1018, in _optimize
loss_dict["loss"].backward()
File "/media/DATA-2/TTS/coqui/tts_coqui/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/media/DATA-2/TTS/coqui/tts_coqui/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
{ "CUDA": { "GPU": [ "NVIDIA GeForce GTX 1660 Ti" ], "available": true, "version": "11.3" }, "Packages": { "PyTorch_debug": false, "PyTorch_version": "1.11.0+cu113", "TTS": "0.6.1", "numpy": "1.19.5" }, "System": { "OS": "Linux", "architecture": [ "64bit", "ELF" ], "processor": "x86_64", "python": "3.8.0", "version": "#123~18.04.1-Ubuntu SMP Fri Apr 8 09:48:52 UTC 2022" } }
It is, in general, a sign of insufficient VRAM. You can try a GPU with a larger VRAM or reduce the batch size or limit the maximum audio length allowed in training.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
This issue was resolved by adding memory
Hi @blackmamba1122,
I am facing the same issue.
Can you please give more detail on what you mean by "adding memory"? Did you use a different GPU with larger VRAM?
Thanks.
Hi @blackmamba1122,
I am facing the same issue.
Can you please give more detail on what you mean by "adding memory"? Did you use a different GPU with larger VRAM?
Thanks.
Hi, sorry I just read your comment. Yes, before I used RAM: 16 GiB dual channel, and now I used 32 GiB dual channel. But it only works in batch_size=4.
I experienced this issue on Linux and I solved it by running
$ unset LD_LIBRARY_PATH
I got this error too. It's funny that re-executing the code solved the problem for me.
unset LD_LIBRARY_PATH
Saved my day! Thanks!
I still got this error after trying 'unset LD_LIBRARY_PATH'. The thing for me is it can run on CPU. But not on cuda
I'm still having this issue after running 'unset LD_LIBRARY_PATH'
unset LD_LIBRARY_PATH
Hoping to understand this issue deeper. For those who found success in this command, why does this command work for you? I'm reading this link but I'm not following how it is connected to the CUDA error.
I experienced this issue on Linux and I solved it by running
$ unset LD_LIBRARY_PATH
I LOVE YOU
I experienced this issue on Linux and I solved it by running
$ unset LD_LIBRARY_PATH
I figured this out!!! THX! I don't understand it, but I was just in awe.
I still got this error after trying 'unset LD_LIBRARY_PATH'. The thing for me is it can run on CPU. But not on cuda
Same. Can someone help?
Also fixed for me after $ unset LD_LIBRARY_PATH
and would love to know why.
I experienced this issue on Linux and I solved it by running
$ unset LD_LIBRARY_PATH
Thank you very much, this worked for me, but I specifically want to know why?
I used unset LD_LIBRARY_PATH
and it solved my problem. I also wonder why this can work.
I need to switch CUDA versions frequently, so I wrote a script myself to modify CUDA_PATH and LD_LIBRARY_PATH, which is also the reason why I triggered this bug, refer to this link, I guess this error is caused by linking your system's LD_LIBRARY_PATH to the program, but what the program really needs is the dynamic link library in conda.
I experienced this issue on Linux and I solved it by running
$ unset LD_LIBRARY_PATH
Thanks for your advice, my program works.
π Description
i got the UserWarning when i try to training tacotron2 with ljspeech. the warning is as below:
and after epoch 8/1000 in step 1886/3243 i got error like this:
Is that the cause of the error? how to handle it?
I will display the error in its entirety
Environment
{ "CUDA": { "GPU": [ "NVIDIA GeForce GTX 1660 Ti" ], "available": true, "version": "10.2" }, "Packages": { "PyTorch_debug": false, "PyTorch_version": "1.11.0+cu102", "TTS": "0.6.1", "numpy": "1.19.5" }, "System": { "OS": "Linux", "architecture": [ "64bit", "ELF" ], "processor": "x86_64", "python": "3.8.0", "version": "#118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022" } }
Additional context
I've reduced the batch size, currently I'm using batch size = 4. Is this also caused by reducing the batch size? But if it's not reduced it will be OOM.