Closed Arminkhayati closed 1 year ago
With such long seq lengths, you will inevitably sample a batch of all long audio files, and therefore oom anytime you sample multiple long duration audio files in training.
The problem is my GPU memory keeps increasing until CUDA OUT OF MEMORY pop ups. At the start it is 6GB but it increases slowly until taking all of it. I still couldn't fine any solution. Please help me if you have any.
Here is the boxplot of duration of my audio files. Train
Test/Validation
Look at the outliers there - your mean is around 5-8 seconds, with plenty of samples in the 15-30 sec range. You're hitting bad batches with too long samples. First I'd suggest dropping all samples greater than 20 seconds, of doing manual segmentation with MFA or CTC segmentation to get it to below 20 seconds.
Then use a reasonable batch size for your gpu. Start small with 4 or 8 and use grad accumulation
Look at the outliers there - your mean is around 5-8 seconds, with plenty of samples in the 15-30 sec range. You're hitting bad batches with too long samples. First I'd suggest dropping all samples greater than 20 seconds, of doing manual segmentation with MFA or CTC segmentation to get it to below 20 seconds.
Then use a reasonable batch size for your gpu. Start small with 4 or 8 and use grad accumulation
As I mentioned above, I am using grad accumulation and my bach size is 4. I don't think its about the duration of my samples, because model runs for many epochs (like 20 to 40) but suddenly it stops because of cuda out of memory error. My memory gpu is increasing by time and I don't know what causing it. At the start only 6GB of my memory is taken but by time it goes to 12GB. It's not a normal behavior and not related to my audio files size or duration.
For reference, especially if you use RNNT models, we don't use batch sizes above 8 for 20 second Max duration even in 32 GPU ram.
For the sake of an experiment why not just drop the longer audio from the manifest and try one training run with nothing else changed ? Cap max duration to 20 sec and keep batch size 4
For reference, especially if you use RNNT models, we don't use batch sizes above 8 for 20 second Max duration even in 32 GPU ram.
For the sake of an experiment why not just drop the longer audio from the manifest and try one training run with nothing else changed ? Cap max duration to 20 sec and keep batch size 4
Here I am using a CTC model not RNNT. If the audio samples were causing it then it wouldn't run even for a single epoch. Because all data will be seen at each epoch. At the start of the training it takes half the size of my gpu but when training goes on the memory usage increases too. Something that Nemo or Lightning are logging or tracking stays in memory. That is the problem not the data. Even though I can't remove any sample because I need all of them.
You can try and check with reduced data, at least to see if it trains properly or not. If you're not willing to check that then ill have to close the thread since the most likely reason is the data.
You can try and check with reduced data, at least to see if it trains properly or not. If you're not willing to check that then ill have to close the thread since the most likely reason is the data.
Ok I will but it will take time to report the result. thank you.
Describe the bug
Hi I am trying to run QuartzNet 10x5 model on my dataset of almost 200 hours audio. But the strange thing is I got this error:
I can't understand why I got out of memory error after 44 epochs!
Steps/Code to reproduce bug
TRAIN_DS config:
VALIDATION_DS config:
TRAINER config:
Environment overview (please complete the following information)
Environment details
Additional context
I ran my model on 2 GPU's : TITAN X 12288MiB and GTX 1080 Ti 11264MiB