LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.42k stars 137 forks source link

Question about distributed gradient calculation #119

Open Yogurt928 opened 1 year ago

Yogurt928 commented 1 year ago

Dear author,

I'm wondering the difference between w/o gather_with_grad when collect tensor from different gpus: https://github.com/LAION-AI/CLAP/blob/6b1b4b5b4b87f4e19d3836d2ae7d7272e1c69410/src/laion_clap/clap_module/loss.py#L59C11-L59C11 Looks like the final return (all_audio_features & all_text_features) is the same?

Another related question is, during distributed training, how is the grad calculated? From my understanding of the code, for each gpu, it collects audio_feature & text_feature from all gpus and then do loss calculation and backward separately. Does it mean duplicate calculation as the gathered features should be the same for each gpu? Please correct my understanding if it is wrong, thanks!

Yogurt928 commented 1 year ago

Moreover, the model has already been packaged by DistributedDataParallel: https://github.com/LAION-AI/CLAP/blob/6b1b4b5b4b87f4e19d3836d2ae7d7272e1c69410/src/laion_clap/training/main.py#L274

which means the gradient synchronization operation among different GPUs should be processed automatically. Then, is the feature gathering operation necessary?