NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.53k stars 3.23k forks source link

[FastPitch1.1/pytorch] training process broken after 100+ epoches because out of memory #1141

Closed JohnHerry closed 2 years ago

JohnHerry commented 2 years ago

Related to FastPitch1.1/pytorch

Describe the bug

I am training FastPitch on 4 X ( GeForce RTX 3090 24G memory.) My batch_size is set to 10. The training process broken on the 156'th epoch but not the first. beacuse CUDA out of memory. If memory is not enough, Why not in the first epoch? Is there any unused memory keeps unreleased during the trainging process?

To Reproduce Steps to reproduce the behavior:

  1. Install '...'
  2. Set "..."
  3. Launch '...'

Expected behavior A clear and concise description of what you expected to happen.

Environment Please provide at least:

rygopu commented 2 years ago

This seems to be like you are not releasing the memory at appropriate place. could you give more Information? Dataset that you use, preprocessing method, could you plot a graph on memory consumption for every epoch? Did you by any chance change the loss update steps / loader steps ?

JohnHerry commented 2 years ago

This seems to be like you are not releasing the memory at appropriate place. could you give more Information? Dataset that you use, preprocessing method, could you plot a graph on memory consumption for every epoch? Did you by any chance change the loss update steps / loader steps ?

Thank you for the help. My dataset contains two speakers, one man and one woman, total 22 hours , 16K sample rate, the longest sample is about 22s speech audio, average sample length is 4s.

I did not change anything in the main training process, except for the text_cleaner and symbol_set, because I am training on dataset other then English language. My input is pure phoneme sequence so I set p-arpabet as 0.0 to ignore the cmudict. And I had set different hop_size, win_size, filter_length to fit my audio.

Here is the training comand:

CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m torch.distributed.launch --nproc_per_node 4 train.py \
--output=./checkpoints \
--dataset-path=/home/Data/FastPitchDataRoot \
--log-file=./train_logs/dllogger.log \
--epochs=2000 \
--amp --cuda --cudnn-benchmark \
--load-pitch-from-disk  --load-mel-from-disk  \
--learning-rate=0.1 \
--traning-files=filelists/mixlang_train.txt \    # format: MelPath|F0Path|Text|SpeakerID
--validation-files=filelists/mixlang_val.txt \
--text-cleaners=mixlang_cleaner \
--symbol-set=phoneme_symbols \
--p-arpabet=0.0 \
--n-speakers=2 \
--pitch-mean=189.07 --pitch-std=65.35 \
--sampling-rate=16000 \
--filter-length=1024 \
--batch-size=10 \
--hop-length=200 --win-length=800 --mel-fmin=60 --mel-fmax=7600 > training.log 2>&1 &

As to the batch size, the default 16 will broken the trainging process in the first epoch, so I tried 12, and then 10. but it broken after 100+ epoches as I had said.