Question about distributed gradient calculation

LAION-AI / CLAP

Contrastive Language-Audio Pretraining

Creative Commons Zero v1.0 Universal

1.42k stars 137 forks source link

Dear author,

I'm wondering the difference between w/o gather_with_grad when collect tensor from different gpus: https://github.com/LAION-AI/CLAP/blob/6b1b4b5b4b87f4e19d3836d2ae7d7272e1c69410/src/laion_clap/clap_module/loss.py#L59C11-L59C11 Looks like the final return (all_audio_features & all_text_features) is the same?

Another related question is, during distributed training, how is the grad calculated? From my understanding of the code, for each gpu, it collects audio_feature & text_feature from all gpus and then do loss calculation and backward separately. Does it mean duplicate calculation as the gathered features should be the same for each gpu? Please correct my understanding if it is wrong, thanks!

LAION-AI / CLAP

Question about distributed gradient calculation #119