Question regarding SelectElement(index=0) in the modality heads

Hi,

I'm having trouble understanding why we only use index=0 for some of the modalities, for example vision.

https://github.com/facebookresearch/ImageBind/blob/c6a47d6dc2b53eced51d398c181d57049ca59286/imagebind/models/imagebind_model.py#L378-L382

From what I understand that for vision transformer, we would break the image into smaller patches, and the transformer would then process these patches like 'tokens'. So in theory we should have a sequence of patches, and each patch may contain different information. For example the current model configuration would have 16x16 patches.

If we only select the first one as it seems to be in this case with SelectElement(index=0), does this mean that we'll discard all the remaining patches in the process? Will this have negative impact on the performance?

facebookresearch / ImageBind

Question regarding SelectElement(index=0) in the modality heads #106