Open Yogurt928 opened 1 year ago
Moreover, the model has already been packaged by DistributedDataParallel: https://github.com/LAION-AI/CLAP/blob/6b1b4b5b4b87f4e19d3836d2ae7d7272e1c69410/src/laion_clap/training/main.py#L274
which means the gradient synchronization operation among different GPUs should be processed automatically. Then, is the feature gathering operation necessary?
Dear author,
I'm wondering the difference between w/o gather_with_grad when collect tensor from different gpus: https://github.com/LAION-AI/CLAP/blob/6b1b4b5b4b87f4e19d3836d2ae7d7272e1c69410/src/laion_clap/clap_module/loss.py#L59C11-L59C11 Looks like the final return (all_audio_features & all_text_features) is the same?
Another related question is, during distributed training, how is the grad calculated? From my understanding of the code, for each gpu, it collects audio_feature & text_feature from all gpus and then do loss calculation and backward separately. Does it mean duplicate calculation as the gathered features should be the same for each gpu? Please correct my understanding if it is wrong, thanks!