Closed wangbo-zhao closed 1 year ago
Hi @wangbo-zhao, thanks for your interest in our work. Indeed, this conflict could happen. The chance of conflict is the same as is during any vision-language model like CLIP's training (which samples images belonging to a set of classes).
To re-iterate, the main motive of contrastive loss is to establish the differences between different tasks, as the number of binary masks varies depending on the task for the same image.
Nonetheless, we uniformly sample the task to decrease the chance of such conflicts during our joint training process. Moreover, due to the random sampling of images, it is unlikely that it would happen for many batches. Still, would be interesting to quantify the chance of such conflicts (which would depend on the dataset).
Thanks for your explanation.
Excellent Work! I have a question about the loss_contrastive. I find that if we are using distributed training, losscontrastive function will collect text and image features from the whole batch. I think there may be a situation: For example, for a semantic segmentation task, there are two images A and B have the same classes, which means that they have exactly the same $$Q{text}$$. For an object query from A, it should be close to a text query from A and far away from other queries. But there is also a text query from B, which is exactly the same as the text query from A. I think this is a conflict. Hope for your reply.