facebookresearch / ImageBind

ImageBind One Embedding Space to Bind Them All
Other
8.38k stars 771 forks source link

Question regarding SelectElement(index=0) in the modality heads #106

Closed michaelnny closed 8 months ago

michaelnny commented 10 months ago

Hi,

I'm having trouble understanding why we only use index=0 for some of the modalities, for example vision.

https://github.com/facebookresearch/ImageBind/blob/c6a47d6dc2b53eced51d398c181d57049ca59286/imagebind/models/imagebind_model.py#L378-L382

From what I understand that for vision transformer, we would break the image into smaller patches, and the transformer would then process these patches like 'tokens'. So in theory we should have a sequence of patches, and each patch may contain different information. For example the current model configuration would have 16x16 patches.

If we only select the first one as it seems to be in this case with SelectElement(index=0), does this mean that we'll discard all the remaining patches in the process? Will this have negative impact on the performance?

liuhui0401 commented 9 months ago

I think this is because the first token is class token. The 16*16 batches as you mentioned are concatenated after the class token.