Training with 2080 Ti - Githubissues

zefyrr commented 4 years ago

Wanted to check if there is much experience in the community with training on the libritts ds with a 2080Ti - 11GB DDR. Was able to downsample the libritts ds to 22050, and then hit a wall with cuda device side assert error.

Any recommendations would be greatly appreciated! Thnx.

CookiePPP commented 4 years ago

post the exact error? (e.g: screenshot or copy & paste)

zefyrr commented 4 years ago

Plz find attached config.json.txt out.log

deepglugs commented 4 years ago

I have trained on an RTX 2060 super and an RTX Titan. Depending on your utterance length, 11GB may not always be enough. For LJspeech with FP16, 8GB seems to be enough. I'm on pytorch 1.6 with cuda 10.2 and 11.1 installed.

Also, you are training against libritts and validating against ljspeech?

CookiePPP commented 4 years ago

the error

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

doesn't match anything I've seen from a VRAM OOM.

My guess is that the CUDA / CUDNN / Pytorch install is invalid.

(I've trained using RTX 2080 Ti and GTX 1080 Ti on this repo and had no problems, I train on Linux and do inference on Windows and Linux)

rafaelvalle commented 4 years ago

@zefyrr what pytorch version and are you using AMP?

zefyrr commented 3 years ago

Thnx, for all the comments. Updated to using docker image nvcr.io/nvidia/pytorch:20.07-py3, which has pytorch 1.6 and cuda 11.

Here is the latest config_libritts.json.txt train.log.txt

Am now seeing: Traceback (most recent call last): File "train.py", line 300, in train(n_gpus, rank, **train_config) File "train.py", line 217, in train for batch in train_loader: File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 347, in next data = self._next_data() File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 973, in _next_data return self._process_data(data) File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 998, in _process_data data.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/workspace/flowtron/data.py", line 102, in getitem mel = self.get_mel(audio) File "/workspace/flowtron/data.py", line 78, in get_mel melspec = self.stft.mel_spectrogram(audio_norm) File "/workspace/flowtron/audio_processing.py", line 127, in mel_spectrogram assert(torch.min(y.data) >= -1) RuntimeError: operation does not have an identity.

Resampled libritts using this command: sox input.wav -r 22050 output.wav

zefyrr commented 3 years ago

@rafaelvalle : Am not using AMP, just using one gpu. That should be ok right?

zefyrr commented 3 years ago

Was able to resolve issue 👍, this was helpful: https://github.com/NVIDIA/flowtron/issues/9#issuecomment-629628804

NVIDIA / flowtron

Training with 2080 Ti #83