SHI-Labs / OneFormer

OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023
https://praeclarumjj3.github.io/oneformer
MIT License
1.41k stars 128 forks source link

A question during distributed training. #23

Closed wangbo-zhao closed 1 year ago

wangbo-zhao commented 1 year ago

Excellent Work! I have a question about the loss_contrastive. I find that if we are using distributed training, losscontrastive function will collect text and image features from the whole batch. I think there may be a situation: For example, for a semantic segmentation task, there are two images A and B have the same classes, which means that they have exactly the same $$Q{text}$$. For an object query from A, it should be close to a text query from A and far away from other queries. But there is also a text query from B, which is exactly the same as the text query from A. I think this is a conflict. Hope for your reply.

praeclarumjj3 commented 1 year ago

Hi @wangbo-zhao, thanks for your interest in our work. Indeed, this conflict could happen. The chance of conflict is the same as is during any vision-language model like CLIP's training (which samples images belonging to a set of classes).

To re-iterate, the main motive of contrastive loss is to establish the differences between different tasks, as the number of binary masks varies depending on the task for the same image.

Nonetheless, we uniformly sample the task to decrease the chance of such conflicts during our joint training process. Moreover, due to the random sampling of images, it is unlikely that it would happen for many batches. Still, would be interesting to quantify the chance of such conflicts (which would depend on the dataset).

wangbo-zhao commented 1 year ago

Thanks for your explanation.