Sense-GVT / DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
622 stars 31 forks source link

What is AllGather for. Why use ALLGather. #16

Closed lyccol closed 2 years ago

lyccol commented 2 years ago

https://github.com/Sense-GVT/DeCLIP/blob/9d9e25da10e2299cf0c84b6e0be1c49085565d22/prototype/model/clip.py#L136-L146

zlccccc commented 2 years ago

Since CLIP (Contrastive Image Language Pre-training) requires a large batch size, we use all-gather during DDP (Distributed Data-Parallel Acceleration) to scale up the batch size by synchronising data from multiple cards. For example, when we use 128 cards with a batch size of 256 on each card, the dimension of image_features (\resp, text_features) per thread is [256, feature_dim], while the dimension of gathered_image_features (\resp, gathered_text_features) is [128*256, feature_dim], so after the gradient synchronisation of the loss function, it is equivalent to directly using the batch size of 32768. The same reason holds for SLIP, FILIP, DeCLIP, DeFILIP