Questions on contrastive alignment loss

Hi! Thank you for releasing such a wonderful work. I have a few questions on calculating the contrastive alignment loss. Specifically:

Say the image and sentence are encoded into N image features and M text features, concatenated, and fed into the cross-encoder; decoder has K object queries.

Do you mean the contrastive loss is calculated between: the M embeddings from the total N + M output of the cross-encoder, that are associated with the M text tokens, and the K embeddings output by decoder before the head?
Since the image backbone produces dense features, how do you determine which image & text features belong to object o_i? (o_i mentioned in section 2.2.2)

Thank you in advance for your help!

ashkamath / mdetr

Questions on contrastive alignment loss #62