Paper states that batch size of 64 on each of the 8 GPUs.
Then, how was the contrastive loss calculated?
Calculate loss from each gpu(local batch 64), backward for each GPU, aggregate gradients(like ring-reduce) and update (just using DDP)
Calculate embeddings from all GPUs, get loss for global batch(64 * 8) and back pass (using additional methods such as dist.all_gather()).
Which of the two methods above was used for your training? If neither, can you explain how many image-text pairs were used when calculating contrastive loss?
Paper states that batch size of 64 on each of the 8 GPUs. Then, how was the contrastive loss calculated?
Calculate loss from each gpu(local batch 64), backward for each GPU, aggregate gradients(like ring-reduce) and update (just using DDP) Calculate embeddings from all GPUs, get loss for global batch(64 * 8) and back pass (using additional methods such as dist.all_gather()). Which of the two methods above was used for your training? If neither, can you explain how many image-text pairs were used when calculating contrastive loss?