SayaSS / vits-finetuning

Fine-Tuning your VITS model using a pre-trained model
MIT License
546 stars 86 forks source link

CUDA error #28

Closed sent00 closed 1 year ago

sent00 commented 1 year ago

Hello. I am learning on this notebook, but when changing to a premium GPU and learning on NVIDIA A100-SXM4-40GB. I am getting this error, is there any way to deal with it? I can learn with Tesla T4d, but, but I was able to use premium GPU until last week, so I would like to learn with that if possible.


NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Traceback (most recent call last): File "/content/vits-finetuning/train_ms.py", line 306, in main() File "/content/vits-finetuning/train_ms.py", line 56, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/content/vits-finetuning/train_ms.py", line 106, in run net_g = DDP(net_g, device_ids=[rank]) File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: CUDA error: invalid device function CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

KagaDoyan commented 1 year ago

not sure if this gonna work for you but it is worth trying https://discuss.pytorch.org/t/nvidia-a100-gpu-runtimeerror-cudnn-error-cudnn-status-mapping-error/121648 or try: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

sent00 commented 1 year ago

I've tried this way, but I get an error in both cases. https://discuss.pytorch.org/t/nvidia-a100-gpu-runtimeerror-cudnn-error-cudnn-status-mapping-error/121648 When I run the following code as described here, I get this error


!pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.1. Make sure that libnvrtc-builtins.so.11.1 is installed correctly. nvrtc compilation failed:


I also get this error when I run this code


pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118


-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/content/vits-finetuning/train_ms.py", line 124, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/content/vits-finetuning/train_ms.py", line 162, in train_and_evaluate y_hat_mel = mel_spectrogram_torch( File "/content/vits-finetuning/mel_processing.py", line 104, in mel_spectrogram_torch spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.

SayaSS commented 1 year ago

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/content/vits-finetuning/train_ms.py", line 124, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/content/vits-finetuning/train_ms.py", line 162, in train_and_evaluate y_hat_mel = mel_spectrogram_torch( File "/content/vits-finetuning/mel_processing.py", line 104, in mel_spectrogram_torch spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.

For this error, you could try editing line 104 in mel_processing.py.

-  spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], center=center, pad_mode='reflect', normalized=False, onesided=True)
+  spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
sent00 commented 1 year ago

After editing mel_processing.py, I re-ran the code below, but I still get this error. I have tried various versions of PyTorch to test it out, but it reverts back to the CUDA error.

!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/content/vits-finetuning/data_utils.py", line 236, in __getitem__
    return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
  File "/content/vits-finetuning/data_utils.py", line 199, in get_audio_text_speaker_pair
    spec, wav = self.get_audio(audiopath)
  File "/content/vits-finetuning/data_utils.py", line 214, in get_audio
    spec = spectrogram_torch(audio_norm, self.filter_length,
  File "/content/vits-finetuning/mel_processing.py", line 66, in spectrogram_torch
    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
  File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
    sys.exit(1)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 318, in _bootstrap
    util._exit_function()
  File "/usr/lib/python3.9/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3930) is killed by signal: Terminated. 
Traceback (most recent call last):
  File "/content/vits-finetuning/train_ms.py", line 306, in <module>
    main()
  File "/content/vits-finetuning/train_ms.py", line 56, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/content/vits-finetuning/train_ms.py", line 124, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/content/vits-finetuning/train_ms.py", line 144, in train_and_evaluate
    for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers) in enumerate(tqdm(train_loader)):
  File "/usr/local/lib/python3.9/dist-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.9/dist-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/content/vits-finetuning/data_utils.py", line 236, in __getitem__
    return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
  File "/content/vits-finetuning/data_utils.py", line 199, in get_audio_text_speaker_pair
    spec, wav = self.get_audio(audiopath)
  File "/content/vits-finetuning/data_utils.py", line 214, in get_audio
    spec = spectrogram_torch(audio_norm, self.filter_length,
  File "/content/vits-finetuning/mel_processing.py", line 66, in spectrogram_torch
    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
  File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
sent00 commented 1 year ago

The error has been resolved in today's update. Thank you very much