Open YinAoXiong opened 1 year ago
My idea is just like yours. After debugging, I found that during the training epoch, all GPUs compute the same global loss with the same sim_matrix instead of individually calculating local losses and then gathering and averaging them. There is a clear computation overlap here. I also have seen that in the function "train_epoch", there is an useless computation "loss.mean()" that seems do nothing after the model.forward(). We only need do local loss following the https://github.com/openai/CLIP/issues/132 and do loss.backward(), The gradient synchronization will be done automatically by DDP.
https://github.com/ArrowLuo/CLIP4Clip/blob/508ffa3de39ba0563a03199c440ab602a72e9b6f/modules/modeling.py#L400
The current code seems to calculate the loss on the global similarity matrix on each gpu. Computing loss only for local and global features as described in https://github.com/openai/CLIP/issues/132 seems to be more computationally and memory efficient. Sorry to bother you if I misunderstood the code