Open mr-segfault opened 1 year ago
After doing a little digging, other people online have encountered this bug with dual GPU setups; I have disabled dual GPU training (only training on one) and that seemed to work up until it hits the error associated with #167
INFO:user-test-3:[14000, 9.976276699833672e-05]
INFO:user-test-3:loss_disc=2.972, loss_gen=3.147, loss_fm=8.040,loss_mel=17.520, loss_kl=0.971
INFO:user-test-3:Saving model and optimizer state at epoch 20 to ./logs/user-test-3/G_14060.pth
INFO:user-test-3:Saving model and optimizer state at epoch 20 to ./logs/user-test-3/D_14060.pth
INFO:user-test-3:====> Epoch: 20
INFO:user-test-3:Training is done. The program is closed.
INFO:user-test-3:saving final ckpt:Success.
Traceback (most recent call last):
File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in
What is the version of Pytorch? This bug was found in 1.5.0
I am using a venv - so whatever was installed via the first preamble git instructions + pip requirements -r is what it should be using.
I used this to check: python -c "import torch; print(torch.version)" Received: 2.0.0+cu117
Which of these GPUs did this occur on?
It looks like the 1660ti doesn't do several steps due to lack of resources, which is fine, but when it goes to the file handling process it crashes. It also happens on the 3060x2 system that can actually do all of the processing/training steps (except for the dual train issue identified in this 215 issue) but then crashes when file writing.
Dual GPUs still broken on current build
INFO:test6:Train Epoch: 6 [56%]
INFO:test6:[1000, 9.993751562304699e-05]
INFO:test6:loss_disc=2.822, loss_gen=3.128, loss_fm=8.556,loss_mel=19.719, loss_kl=1.550
INFO:test6:====> Epoch: 6
INFO:test6:Train Epoch: 7 [67%]
INFO:test6:[1200, 9.99250234335941e-05]
INFO:test6:loss_disc=3.213, loss_gen=2.712, loss_fm=6.322,loss_mel=16.615, loss_kl=1.119
INFO:test6:====> Epoch: 7
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f711e4634d7 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f711e42d36b in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f711e507fa8 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9d4e (0x7f70a83f9d4e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccea6 (0x7f70e6cccea6 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f711e448e77 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f711e44169e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f711e4417b9 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x53b5163 (0x7f70d2bb5163 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupGloo::runLoop(int) + 0x2fe (0x7f70d2bbeb0e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xdc3a3 (0x7f716e6dc3a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #11: <unknown function> + 0x90402 (0x7f71c1490402 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x11f590 (0x7f71c151f590 in /lib/x86_64-linux-gnu/libc.so.6)
Traceback (most recent call last):
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in <module>
main()
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 50, in main
mp.spawn(
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 202, in run
train_and_evaluate(
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 415, in train_and_evaluate
scaler.scale(loss_gen_all).backward()
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Encountering same issue running a single RTX4090
Using an Ubuntu system, 2x3060 (12g ea) and the latest version of RVC, commit c4a1810
During training, after a few epochs complete, a CUDA error is thrown: