From what I understand that for vision transformer, we would break the image into smaller patches, and the transformer would then process these patches like 'tokens'. So in theory we should have a sequence of patches, and each patch may contain different information. For example the current model configuration would have 16x16 patches.
If we only select the first one as it seems to be in this case with SelectElement(index=0), does this mean that we'll discard all the remaining patches in the process? Will this have negative impact on the performance?
Hi,
I'm having trouble understanding why we only use
index=0
for some of the modalities, for example vision.https://github.com/facebookresearch/ImageBind/blob/c6a47d6dc2b53eced51d398c181d57049ca59286/imagebind/models/imagebind_model.py#L378-L382
From what I understand that for vision transformer, we would break the image into smaller patches, and the transformer would then process these patches like 'tokens'. So in theory we should have a sequence of patches, and each patch may contain different information. For example the current model configuration would have
16x16
patches.If we only select the first one as it seems to be in this case with
SelectElement(index=0)
, does this mean that we'll discard all the remaining patches in the process? Will this have negative impact on the performance?