A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / validation, the memory usage of the CPU memory (RAM) continues to rise and never get released, which results in OOM after a number of epochs. This happens while I was using the Lhotse dataloader, while I was able to train a small fast-conformer CTC model for hundreds of epochs, the same version of NeMo only is able to run the training for an XL fast-conformer CTC model for ~20 epochs. ~110 epochs if I use 1/4 of the works for the dataloader. So somehow the dataloader is not releasing memory for the data loaded.
Steps/Code to reproduce bug
shows up in XL fast-conformer CTC and medium fast-conformer CTC RNNT hybrid
standard configs for these models in terms of hyperparameters
@nithinraok has told me to try using limit_validation_batches, use smaller duration audios, use fully_randomized which I haven't fully tested. I'll report back when these are tested but the issue persists so far
Describe the bug
When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / validation, the memory usage of the CPU memory (RAM) continues to rise and never get released, which results in OOM after a number of epochs. This happens while I was using the Lhotse dataloader, while I was able to train a small fast-conformer CTC model for hundreds of epochs, the same version of NeMo only is able to run the training for an XL fast-conformer CTC model for ~20 epochs. ~110 epochs if I use 1/4 of the works for the dataloader. So somehow the dataloader is not releasing memory for the data loaded.
Steps/Code to reproduce bug
Expected behavior
Training should continue until specified stop.
Environment overview (please complete the following information)
Environment details
Additional context
GPU: A100 40G.
@nithinraok has told me to try using limit_validation_batches, use smaller duration audios, use fully_randomized which I haven't fully tested. I'll report back when these are tested but the issue persists so far