What is AllGather for. Why use ALLGather.

Since CLIP (Contrastive Image Language Pre-training) requires a large batch size, we use all-gather during DDP (Distributed Data-Parallel Acceleration) to scale up the batch size by synchronising data from multiple cards. For example, when we use 128 cards with a batch size of 256 on each card, the dimension of image_features (\resp, text_features) per thread is [256, feature_dim], while the dimension of gathered_image_features (\resp, gathered_text_features) is [128*256, feature_dim], so after the gradient synchronisation of the loss function, it is equivalent to directly using the batch size of 32768. The same reason holds for SLIP, FILIP, DeCLIP, DeFILIP

Sense-GVT / DeCLIP

What is AllGather for. Why use ALLGather. #16