multiple gpu training imbalance

NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

BSD 3-Clause "New" or "Revised" License

5.06k stars 1.38k forks source link

multiple gpu training imbalance #497

Closed tranmanhdat closed 2 years ago

tranmanhdat commented 3 years ago

i run train with 4 RTX 2080 8GB, i set batch_size=16 then memory usage as image below, i dont know why GPU 1 take more memory usage than others, when i change batch_size to 32 it lead to OOM CMD train: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m multiproc train.py --output_directory=outdir_voice --log_directory=logdir_voice --n_gpus=4 --hparams=distributed_run=True,fp16_run=True -c outdir_voice/checkpoint_0 --warm_start Memory usage when batch_size=16 Screenshot from 2021-07-02 13-31-53

v-nhandt21 commented 2 years ago

Same problem, have you solve it, I remember I have reached this but forget the reason, it may relate to Cudnn and distributed training.

CookiePPP commented 2 years ago

@v-nhandt21 I would also check the lengths of your audio files since everything is zero padded, a single audio file that's double the length of other audio files would cause the VRAM usage to double on a single card, but not the others.

(and the reason it only affects one card is because shuffling only happens once before epoch 0 when using distributed training. see the warning at the bottom for how to add shuffling between epochs)

In distributed mode, calling the set_epoch() method at the beginning of each epoch before creating the DataLoader iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.

v-nhandt21 commented 2 years ago

@v-nhandt21 I would also check the lengths of your audio files since everything is zero padded, a single audio file that's double the length of other audio files would cause the VRAM usage to double on a single card, but not the others.

(and the reason it only affects one card is because shuffling only happens once before epoch 0 when using distributed training. see the warning at the bottom for how to add shuffling between epochs)

In distributed mode, calling the set_epoch() method at the beginning of each epoch before creating the DataLoader iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.

Great, thank you! One more thing I note here for whom in the future is that we "should" set the world_size in distributed training equal to number of gpus. And in my case, I use cuda11.1 pytorch 1.9 to make everything work ok. Should we close this issues?

tranmanhdat commented 2 years ago

@v-nhandt21 I would also check the lengths of your audio files since everything is zero padded, a single audio file that's double the length of other audio files would cause the VRAM usage to double on a single card, but not the others.

(and the reason it only affects one card is because shuffling only happens once before epoch 0 when using distributed training. see the warning at the bottom for how to add shuffling between epochs)

In distributed mode, calling the set_epoch() method at the beginning of each epoch before creating the DataLoader iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.

i will try, thank you