ashkamath / mdetr

Apache License 2.0
969 stars 125 forks source link

Questions on contrastive alignment loss #62

Closed DianCh closed 2 years ago

DianCh commented 2 years ago

Hi! Thank you for releasing such a wonderful work. I have a few questions on calculating the contrastive alignment loss. Specifically:

Say the image and sentence are encoded into N image features and M text features, concatenated, and fed into the cross-encoder; decoder has K object queries.

  1. Do you mean the contrastive loss is calculated between: the M embeddings from the total N + M output of the cross-encoder, that are associated with the M text tokens, and the K embeddings output by decoder before the head?
  2. Since the image backbone produces dense features, how do you determine which image & text features belong to object o_i? (o_i mentioned in section 2.2.2)

Thank you in advance for your help!