Open OpenJarvisAI opened 1 month ago
Good point. In the paper we compare s2 versus directly extracting features from larger image without splitting (Table 12), and it turns out it's much more inefficient and has worse performance than s2.
Hi, looks like it compare on segmentation task, does LLava task compared? Also, whats the most effect way to reduce the final tokens if using s2?
Hi, yeah the paper only compares on segmentation. S2 uses avg pooling to resize the large-scale feature map to the regular size. To further reduce number of tokens, you can use mlp_downsample
here
I saw the code just have a mlp_downsample, the vit outputs doesn't changed, does the avg pooling you mentioned is mlp_downrsampler?
Is it the specificaly flat_square
mentioned for avg_pool?
Hi, mlp_downsample
will concat the adjacent 2x2 tokens into a single token. For the avg pooling, it's implemented inside S2. S2 will pool the feature map of a large-scale image into a smaller size that corresponds to a regular-size image. See the code here
@bfshi Hi, does it means, in S2, if input slices are [1x, 2x, 3x], then just the 2x and 3x will interpolate to 1x to get a normal output size? But from I can saw, the 2x and 3x just the batch size bigger, the input resolution acutally same as oriignal
The way you using actually feed 5 images into vit,
how's it compare with interpolate to 768x768 which equal to send 4 images into vit but with different manner?