mr-segfault commented 1 year ago

Using an Ubuntu system, 2x3060 (12g ea) and the latest version of RVC, commit c4a1810

During training, after a few epochs complete, a CUDA error is thrown:


INFO:user-test-3:====> Epoch: 1
INFO:user-test-3:Train Epoch: 2 [11%]
INFO:user-test-3:[200, 9.99875e-05]
INFO:user-test-3:loss_disc=3.124, loss_gen=2.644, loss_fm=8.702,loss_mel=19.773, loss_kl=1.555
INFO:user-test-3:====> Epoch: 2
INFO:user-test-3:Train Epoch: 3 [22%]
INFO:user-test-3:[400, 9.99750015625e-05]
INFO:user-test-3:loss_disc=3.009, loss_gen=2.687, loss_fm=8.580,loss_mel=19.066, loss_kl=1.653
INFO:user-test-3:====> Epoch: 3
INFO:user-test-3:Train Epoch: 4 [33%]
INFO:user-test-3:[600, 9.996250468730469e-05]
INFO:user-test-3:loss_disc=3.033, loss_gen=2.489, loss_fm=7.798,loss_mel=18.964, loss_kl=1.770
INFO:user-test-3:====> Epoch: 4
INFO:user-test-3:Train Epoch: 5 [44%]
INFO:user-test-3:[800, 9.995000937421877e-05]
INFO:user-test-3:loss_disc=2.957, loss_gen=2.745, loss_fm=7.675,loss_mel=18.730, loss_kl=1.756
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f37fab9e4d7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f37fab6836b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f38008b6fa8 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9d4e (0x7f378a7f9d4e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccea6 (0x7f37c90ccea6 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f37fab83e77 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f37fab7c69e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f37fab7c7b9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x752458 (0x7f37c9352458 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f37c93527e5 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x12c1dc (0x55db69db61dc in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #11: <unknown function> + 0x154b6f (0x55db69ddeb6f in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #12: <unknown function> + 0x167367 (0x55db69df1367 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #13: <unknown function> + 0x167394 (0x55db69df1394 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #14: <unknown function> + 0x167394 (0x55db69df1394 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #15: <unknown function> + 0x171a2c (0x55db69dfba2c in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #16: <unknown function> + 0x132719 (0x55db69dbc719 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #17: <unknown function> + 0x272015 (0x55db69efc015 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x5ae7 (0x55db69dd79e7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #19: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x8c2 (0x55db69dd27c2 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #21: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x6d0 (0x55db69dd25d0 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #23: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x197b (0x55db69dd387b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #25: <unknown function> + 0x144cb4 (0x55db69dcecb4 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #26: PyEval_EvalCode + 0x86 (0x55db69ebb266 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #27: <unknown function> + 0x25d497 (0x55db69ee7497 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #28: <unknown function> + 0x25645e (0x55db69ee045e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #29: PyRun_StringFlags + 0x81 (0x55db69ed8a71 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #30: PyRun_SimpleStringFlags + 0x3c (0x55db69ed894c in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #31: Py_RunMain + 0x377 (0x55db69ed7ae7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #32: Py_BytesMain + 0x2b (0x55db69eaf38b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #33: <unknown function> + 0x23510 (0x7f38a3823510 in /lib/x86_64-linux-gnu/libc.so.6)
frame #34: __libc_start_main + 0x89 (0x7f38a38235c9 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: _start + 0x25 (0x55db69eaf285 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)

Traceback (most recent call last):
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in <module>
    main()
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 50, in main
    mp.spawn(
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 202, in run
    train_and_evaluate(
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 389, in train_and_evaluate
    wave = commons.slice_segments(
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/infer_pack/commons.py", line 49, in slice_segments
    ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

mr-segfault commented 1 year ago

After doing a little digging, other people online have encountered this bug with dual GPU setups; I have disabled dual GPU training (only training on one) and that seemed to work up until it hits the error associated with #167

INFO:user-test-3:[14000, 9.976276699833672e-05] INFO:user-test-3:loss_disc=2.972, loss_gen=3.147, loss_fm=8.040,loss_mel=17.520, loss_kl=0.971 INFO:user-test-3:Saving model and optimizer state at epoch 20 to ./logs/user-test-3/G_14060.pth INFO:user-test-3:Saving model and optimizer state at epoch 20 to ./logs/user-test-3/D_14060.pth INFO:user-test-3:====> Epoch: 20 INFO:user-test-3:Training is done. The program is closed. INFO:user-test-3:saving final ckpt:Success. Traceback (most recent call last): File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in main() File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 50, in main mp.spawn( File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 149, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 149 /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d `

Tps-F commented 1 year ago

What is the version of Pytorch? This bug was found in 1.5.0

mr-segfault commented 1 year ago

I am using a venv - so whatever was installed via the first preamble git instructions + pip requirements -r is what it should be using.

I used this to check: python -c "import torch; print(torch.version)" Received: 2.0.0+cu117

Tps-F commented 1 year ago

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/167#issuecomment-1528941884

Which of these GPUs did this occur on?

mr-segfault commented 1 year ago

167 occurs in both environments I have - one environment is a dual 3060 (12gb ea) and another is a single 1660ti -- I initially try out things on the lower powered card and if they don't work move to the more memory available system and try there.

It looks like the 1660ti doesn't do several steps due to lack of resources, which is fine, but when it goes to the file handling process it crashes. It also happens on the 3060x2 system that can actually do all of the processing/training steps (except for the dual train issue identified in this 215 issue) but then crashes when file writing.

mr-segfault commented 1 year ago

Dual GPUs still broken on current build

INFO:test6:Train Epoch: 6 [56%]
INFO:test6:[1000, 9.993751562304699e-05]
INFO:test6:loss_disc=2.822, loss_gen=3.128, loss_fm=8.556,loss_mel=19.719, loss_kl=1.550
INFO:test6:====> Epoch: 6
INFO:test6:Train Epoch: 7 [67%]
INFO:test6:[1200, 9.99250234335941e-05]
INFO:test6:loss_disc=3.213, loss_gen=2.712, loss_fm=6.322,loss_mel=16.615, loss_kl=1.119
INFO:test6:====> Epoch: 7
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f711e4634d7 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f711e42d36b in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f711e507fa8 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9d4e (0x7f70a83f9d4e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccea6 (0x7f70e6cccea6 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f711e448e77 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f711e44169e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f711e4417b9 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x53b5163 (0x7f70d2bb5163 in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupGloo::runLoop(int) + 0x2fe (0x7f70d2bbeb0e in /home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xdc3a3 (0x7f716e6dc3a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #11: <unknown function> + 0x90402 (0x7f71c1490402 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x11f590 (0x7f71c151f590 in /lib/x86_64-linux-gnu/libc.so.6)

Traceback (most recent call last):
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in <module>
    main()
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 50, in main
    mp.spawn(
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 202, in run
    train_and_evaluate(
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 415, in train_and_evaluate
    scaler.scale(loss_gen_all).backward()
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/user/Retrieval-based-Voice-Conversion-WebUI/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

paulerbear commented 11 months ago

Encountering same issue running a single RTX4090

RVC-Project / Retrieval-based-Voice-Conversion-WebUI

[Bug] RuntimeError: CUDA error: an illegal instruction was encountered #215

167 occurs in both environments I have - one environment is a dual 3060 (12gb ea) and another is a single 1660ti -- I initially try out things on the lower powered card and if they don't work move to the more memory available system and try there.