clovaai / voxceleb_trainer

In defence of metric learning for speaker recognition
MIT License
1.02k stars 272 forks source link

Q: distribute training doesn't seems to split dataset between GPUs #98

Closed asimov-aiz closed 3 years ago

asimov-aiz commented 3 years ago

DistributeDataParallel usually used with DistributedSampler to split dataset between GPUs. The current implementation of the voxceleb_sampler doesn't seem to aware of number of GPUs. The document says the whole dataset will pass through each GPU. Could help me understand how does it work? Why this is identical to single GPU training even we divide the test_interval and max_epoch

For example, with single GPU, when train for epoch. We start from A checkpoint, then train one epoch to reach B, then C, then D for 4 epoch.

The current implementation seems suggest we start from A, then train dataset*4 to reach next state of A. Doesn't seems the same as A->train dataset -> B-> train dataset -> C -> train dataset->D

CaptainPrice12 commented 3 years ago

I am also a little confused about it. Generally, DistributedSampler is used to separate the dataset and each GPU will be fed the corresponding part of the dataset. But the readme file mentioned each GPU will use the whole dataset in each epoch. So in each epoch, dataset * 4 (if using 4 gpus) will go through GPUs, right?

If so, why we use DDP? For codes in this repo, the training time for each epoch with multi-gpu should be close to single-gpu training, because every gpu will use the whole dataset instead of a part of it( such as dataset/# of gpus). It seems that DDP here doesn't accelerate the training process. Could you @joonson please help us with this problem? Can anyone provide some ideas about this issue? Thanks a lot!

joonson commented 3 years ago

We couldn't use PyTorch's DistributedSampler off-the-shelf since we need to first batch samples into pairs, triplets, etc. The recent update contains the capability to divide the pairs, triplets, etc. to each GPU. Please let us know if this works for you.

CaptainPrice12 commented 3 years ago

Thank you so much for the update!

joonson commented 3 years ago

I will assume this as solved and close the issue.