Closed sent00 closed 1 year ago
not sure if this gonna work for you but it is worth trying https://discuss.pytorch.org/t/nvidia-a100-gpu-runtimeerror-cudnn-error-cudnn-status-mapping-error/121648 or try: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
I've tried this way, but I get an error in both cases. https://discuss.pytorch.org/t/nvidia-a100-gpu-runtimeerror-cudnn-error-cudnn-status-mapping-error/121648 When I run the following code as described here, I get this error
RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.1. Make sure that libnvrtc-builtins.so.11.1 is installed correctly. nvrtc compilation failed:
I also get this error when I run this code
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/content/vits-finetuning/train_ms.py", line 124, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/content/vits-finetuning/train_ms.py", line 162, in train_and_evaluate y_hat_mel = mel_spectrogram_torch( File "/content/vits-finetuning/mel_processing.py", line 104, in mel_spectrogram_torch spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/content/vits-finetuning/train_ms.py", line 124, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/content/vits-finetuning/train_ms.py", line 162, in train_and_evaluate y_hat_mel = mel_spectrogram_torch( File "/content/vits-finetuning/mel_processing.py", line 104, in mel_spectrogram_torch spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
For this error, you could try editing line 104 in mel_processing.py.
- spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], center=center, pad_mode='reflect', normalized=False, onesided=True)
+ spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
After editing mel_processing.py, I re-ran the code below, but I still get this error. I have tried various versions of PyTorch to test it out, but it reverts back to the CUDA error.
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/vits-finetuning/data_utils.py", line 236, in __getitem__
return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
File "/content/vits-finetuning/data_utils.py", line 199, in get_audio_text_speaker_pair
spec, wav = self.get_audio(audiopath)
File "/content/vits-finetuning/data_utils.py", line 214, in get_audio
spec = spectrogram_torch(audio_norm, self.filter_length,
File "/content/vits-finetuning/mel_processing.py", line 66, in spectrogram_torch
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
sys.exit(1)
SystemExit: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/usr/lib/python3.9/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/usr/lib/python3.9/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 43, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3930) is killed by signal: Terminated.
Traceback (most recent call last):
File "/content/vits-finetuning/train_ms.py", line 306, in <module>
main()
File "/content/vits-finetuning/train_ms.py", line 56, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/content/vits-finetuning/train_ms.py", line 124, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/content/vits-finetuning/train_ms.py", line 144, in train_and_evaluate
for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers) in enumerate(tqdm(train_loader)):
File "/usr/local/lib/python3.9/dist-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 634, in __next__
data = self._next_data()
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/usr/local/lib/python3.9/dist-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/vits-finetuning/data_utils.py", line 236, in __getitem__
return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
File "/content/vits-finetuning/data_utils.py", line 199, in get_audio_text_speaker_pair
spec, wav = self.get_audio(audiopath)
File "/content/vits-finetuning/data_utils.py", line 214, in get_audio
spec = spectrogram_torch(audio_norm, self.filter_length,
File "/content/vits-finetuning/mel_processing.py", line 66, in spectrogram_torch
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
File "/usr/local/lib/python3.9/dist-packages/torch/functional.py", line 641, in stft
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
The error has been resolved in today's update. Thank you very much
Hello. I am learning on this notebook, but when changing to a premium GPU and learning on NVIDIA A100-SXM4-40GB. I am getting this error, is there any way to deal with it? I can learn with Tesla T4d, but, but I was able to use premium GPU until last week, so I would like to learn with that if possible.
NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Traceback (most recent call last): File "/content/vits-finetuning/train_ms.py", line 306, in
main()
File "/content/vits-finetuning/train_ms.py", line 56, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/content/vits-finetuning/train_ms.py", line 106, in run net_g = DDP(net_g, device_ids=[rank]) File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: CUDA error: invalid device function CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.