NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.12k stars 2.52k forks source link

OOM with RAM with Lhotse #11303

Open riqiang-dp opened 2 hours ago

riqiang-dp commented 2 hours ago

Describe the bug

When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / validation, the memory usage of the CPU memory (RAM) continues to rise and never get released, which results in OOM after a number of epochs. This happens while I was using the Lhotse dataloader, while I was able to train a small fast-conformer CTC model for hundreds of epochs, the same version of NeMo only is able to run the training for an XL fast-conformer CTC model for ~20 epochs. ~110 epochs if I use 1/4 of the works for the dataloader. So somehow the dataloader is not releasing memory for the data loaded.

Steps/Code to reproduce bug

trainer:
  devices: -1
  num_nodes: 1
  max_epochs: 150
  max_steps: 150000
  val_check_interval: 1000
  accelerator: auto
  strategy: ddp
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  precision: bf16-mixed
  log_every_n_steps: 200
  enable_progress_bar: true
  num_sanity_val_steps: 1
  check_val_every_n_epoch: 1
  sync_batchnorm: true
  enable_checkpointing: false
  logger: false
  benchmark: false
  use_distributed_sampler: false
  limit_train_batches: 1000
 train_ds:
  manifest_filepath: null
  sample_rate: 16000
  batch_size: null
  shuffle: true
  num_workers: 8
  pin_memory: true
  max_duration: 45
  min_duration: 1
  is_tarred: false
  tarred_audio_filepaths: null
  shuffle_n: 2048
  bucketing_strategy: synced_randomized
  bucketing_batch_size: null
  shar_path:
  - xxxxx
  use_lhotse: true
  bucket_duration_bins:
  - xxx
    batch_duration: 600
    quadratic_duration: 30
    num_buckets: 30
    bucket_buffer_size: 10000
    shuffle_buffer_size: 10000
    num_cuts_for_bins_estimate: 10000
    use_bucketing: true

Expected behavior

Training should continue until specified stop.

Environment overview (please complete the following information)

Environment details

Additional context

GPU: A100 40G.

@nithinraok has told me to try using limit_validation_batches, use smaller duration audios, use fully_randomized which I haven't fully tested. I'll report back when these are tested but the issue persists so far

nithinraok commented 2 hours ago

@pzelasko fyi