Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
878 stars 55 forks source link

Hi, Have you compare with s2 [384, 768] scales versus interpolate to 768x768? #46

Open OpenJarvisAI opened 1 month ago

OpenJarvisAI commented 1 month ago

The way you using actually feed 5 images into vit,

how's it compare with interpolate to 768x768 which equal to send 4 images into vit but with different manner?

bfshi commented 1 month ago

Good point. In the paper we compare s2 versus directly extracting features from larger image without splitting (Table 12), and it turns out it's much more inefficient and has worse performance than s2.

lucasjinreal commented 1 month ago

Hi, looks like it compare on segmentation task, does LLava task compared? Also, whats the most effect way to reduce the final tokens if using s2?

bfshi commented 1 month ago

Hi, yeah the paper only compares on segmentation. S2 uses avg pooling to resize the large-scale feature map to the regular size. To further reduce number of tokens, you can use mlp_downsample here

lucasjinreal commented 1 month ago

I saw the code just have a mlp_downsample, the vit outputs doesn't changed, does the avg pooling you mentioned is mlp_downrsampler?

Is it the specificaly flat_square mentioned for avg_pool?

bfshi commented 1 month ago

Hi, mlp_downsample will concat the adjacent 2x2 tokens into a single token. For the avg pooling, it's implemented inside S2. S2 will pool the feature map of a large-scale image into a smaller size that corresponds to a regular-size image. See the code here

lucasjinreal commented 1 month ago

@bfshi Hi, does it means, in S2, if input slices are [1x, 2x, 3x], then just the 2x and 3x will interpolate to 1x to get a normal output size? But from I can saw, the 2x and 3x just the batch size bigger, the input resolution acutally same as oriignal