clovaai / voxceleb_trainer

In defence of metric learning for speaker recognition
MIT License
1.02k stars 272 forks source link

Bug with Multi gpu training #93

Closed 009deep closed 3 years ago

009deep commented 3 years ago

Thanks for incorporating multi-gpu support with DistributedDataParallel . It's working as mentioned, but I have observed bug at the end of process. Below are the details:

For multi gpu training, it doesn't stop gracefully. For example, if I am running job for 10 epoch with test interval of 1, I see following issues:

  1. Job stops after 10 epcohs but process doesn't end and all the gpus stay occupied. I need to force kill processes to release resources.
  2. Model at the end of last epoch, in mentioned case epoch 10, is not saved. (1-9 are saved)

Any suggestion on fixing above bug?

CaptainPrice12 commented 3 years ago

Hi @009deep, have you met the issue of GPU low utilization when using multi-gpu training(DistributedDataParallel)?

It is weird because when I run the code on multi gpus(4 or 6 gpus), gpu utilization stays very low( such as 0%) most time, and training is very slow. Can anyone help with this issue? Thanks!

009deep commented 3 years ago

gpu consumption is subjected to your batch size and network architecture. But, if you are using it correctly, it should never be close to 0.

CaptainPrice12 commented 3 years ago

Thanks for the reply @009deep. The problem might be my hard disk. Btw, I am curious about how the voxceleb_sampler works. Because it seems that every GPU will use the whole dataset instead of an individual part in every epoch as mentioned in the documents for distributed mode here. If so, I don't understand the meaning of using DDP for this repo.

009deep commented 3 years ago

That's how DDP is supposed to work. May be what you are looking for is DistributeParallel. Here is difference. I raised a PR with just DistributedParallel, and its implementation with old code can still be found in repo here. As mentioned in pytorch doc and per my experience, DDP is more performant than DP.

CaptainPrice12 commented 3 years ago

That's how DDP is supposed to work. May be what you are looking for is DistributeParallel. Here is difference. I raised a PR with just DistributedParallel, and its implementation with old code can still be found in repo here. As mentioned in pytorch doc and per my experience, DDP is more performant than DP.

Thanks for the reply! I guess the 'DistributedParallel' you mentioned should be DataParallel (DP). I know the differences between DP and DDP. But for DistributedDataParallel(DDP), sampler (such as DistributedSample) is generally used to process dataset and make sure every GPU gets the unique part of the data instead of the whole dataset. Although DP and DDP are different, every GPU generally sees only part of the dataset, not the whole dataset. For the distributed mode here, it seems 1 epoch traverses the whole dataset multiple times (# of GPUs), so it is more like every GPU train an individual model use the whole dataset and it has no big difference with training model using single gpu but with 4 duplicated epochs in 1 epoch now.

Generally, most experiments using DDP will use samplers (such as DistributedSampler https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler) to restrict each gpu will see only a unique subset of the whole dataset. In this way, the batch size will be larger (batch size * gpus) and the training process can be accelerated. I am not sure if I misunderstood your reply, please let me know if anything is incorrect. Thanks!

009deep commented 3 years ago

I see your point there. There does not seem any implementation for non-overlapping data for different gpus. I think the way ddp works, gradient all reduce (which works across multi gpu) along with gradient computation after forward pass, still have better results compared to DataParallel. Or I'd say that's what results seem to indicate.

009deep commented 3 years ago

Also need to understand exact audio selection and augmentation is randon on each audio, so even though overlap is feasible it is less likely.

joonson commented 3 years ago

This has been implemented in the recent update. Please check it out! See #98 for more discussions.