NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.83k stars 2.46k forks source link

CUDA out of memory after 44 epochs #5230

Closed Arminkhayati closed 1 year ago

Arminkhayati commented 1 year ago

Describe the bug

Hi I am trying to run QuartzNet 10x5 model on my dataset of almost 200 hours audio. But the strange thing is I got this error:

RuntimeError: CUDA out of memory. Tried to allocate 130.00 MiB (GPU 0; 11.91 GiB total capacity; 10.99 GiB already allocated; 50.38 MiB free; 11.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I can't understand why I got out of memory error after 44 epochs!

Steps/Code to reproduce bug

TRAIN_DS config:

  batch_size: 4
  trim_silence: true
  max_duration: 62.0
  normalize_transcripts: false
  shuffle: true
  num_workers: 8
  pin_memory: true
  is_tarred: false
  tarred_audio_filepaths: null
  shuffle_n: 2048
  bucketing_strategy: synced_randomized
  bucketing_batch_size: null
  augmentor:
    white_noise:
      prob: 0.4
      min_level: -90
      max_level: -46
    gain:
      prob: 0.4
      min_gain_dbfs: 0
      max_gain_dbfs: 50
    speed:
      prob: 0.3
      sr: 16000
      resample_type: kaiser_best
      min_speed_rate: 0.5
      max_speed_rate: 2.0
      num_rates: -1
    time_stretch:
      prob: 0.3
      min_speed_rate: 0.5
      max_speed_rate: 2.0
      num_rates: -1
    noise:
      prob: 0.45
      manifest_path: ./noise_manifest.json
      min_snr_db: -4
      max_snr_db: 10
      max_gain_db: 300.0

VALIDATION_DS config:

  batch_size: 4
  normalize_transcripts: false
  shuffle: false
  num_workers: 8
  pin_memory: true
  trim_silence: true

TRAINER config:

trainer = ptl.Trainer(devices=-1,
                      accelerator='gpu',
                      max_epochs=EPOCHS,
                      accumulate_grad_batches=1,
                      enable_checkpointing=False,
                      logger=False,
                      log_every_n_steps=5,
                      check_val_every_n_epoch=1,
                      strategy="ddp")

Environment overview (please complete the following information)

Environment details

Additional context

I ran my model on 2 GPU's : TITAN X 12288MiB and GTX 1080 Ti 11264MiB

titu1994 commented 1 year ago

With such long seq lengths, you will inevitably sample a batch of all long audio files, and therefore oom anytime you sample multiple long duration audio files in training.

Arminkhayati commented 1 year ago

The problem is my GPU memory keeps increasing until CUDA OUT OF MEMORY pop ups. At the start it is 6GB but it increases slowly until taking all of it. I still couldn't fine any solution. Please help me if you have any.

Here is the boxplot of duration of my audio files. Train Screenshot 2022-11-13 112159

Test/Validation Screenshot 2022-11-13 111710

titu1994 commented 1 year ago

Look at the outliers there - your mean is around 5-8 seconds, with plenty of samples in the 15-30 sec range. You're hitting bad batches with too long samples. First I'd suggest dropping all samples greater than 20 seconds, of doing manual segmentation with MFA or CTC segmentation to get it to below 20 seconds.

Then use a reasonable batch size for your gpu. Start small with 4 or 8 and use grad accumulation

Arminkhayati commented 1 year ago

Look at the outliers there - your mean is around 5-8 seconds, with plenty of samples in the 15-30 sec range. You're hitting bad batches with too long samples. First I'd suggest dropping all samples greater than 20 seconds, of doing manual segmentation with MFA or CTC segmentation to get it to below 20 seconds.

Then use a reasonable batch size for your gpu. Start small with 4 or 8 and use grad accumulation

As I mentioned above, I am using grad accumulation and my bach size is 4. I don't think its about the duration of my samples, because model runs for many epochs (like 20 to 40) but suddenly it stops because of cuda out of memory error. My memory gpu is increasing by time and I don't know what causing it. At the start only 6GB of my memory is taken but by time it goes to 12GB. It's not a normal behavior and not related to my audio files size or duration.

titu1994 commented 1 year ago

For reference, especially if you use RNNT models, we don't use batch sizes above 8 for 20 second Max duration even in 32 GPU ram.

For the sake of an experiment why not just drop the longer audio from the manifest and try one training run with nothing else changed ? Cap max duration to 20 sec and keep batch size 4

Arminkhayati commented 1 year ago

For reference, especially if you use RNNT models, we don't use batch sizes above 8 for 20 second Max duration even in 32 GPU ram.

For the sake of an experiment why not just drop the longer audio from the manifest and try one training run with nothing else changed ? Cap max duration to 20 sec and keep batch size 4

Here I am using a CTC model not RNNT. If the audio samples were causing it then it wouldn't run even for a single epoch. Because all data will be seen at each epoch. At the start of the training it takes half the size of my gpu but when training goes on the memory usage increases too. Something that Nemo or Lightning are logging or tracking stays in memory. That is the problem not the data. Even though I can't remove any sample because I need all of them.

titu1994 commented 1 year ago

You can try and check with reduced data, at least to see if it trains properly or not. If you're not willing to check that then ill have to close the thread since the most likely reason is the data.

Arminkhayati commented 1 year ago

You can try and check with reduced data, at least to see if it trains properly or not. If you're not willing to check that then ill have to close the thread since the most likely reason is the data.

Ok I will but it will take time to report the result. thank you.