bfshi / scaling_on_scales

When do we not need larger vision models?
MIT License
277 stars 9 forks source link

Hello, nice work, just wonder if used s2 in llava, will the vit can be trained? #9

Open MonolithFoundation opened 2 months ago

MonolithFoundation commented 2 months ago

I am curiosity about if s2 being used in llava, can we open vit to trained?

It looks like, the s2 way dealing with inputs, if the vit training is unfreezed, the result could be not very good. How u think? Any experiments on this?

bfshi commented 2 months ago

Hi,

In the paper the ViT is frozen. I've also tried unfreezing ViT and it's slightly better (which of course depends on your training data). I don't think S2 will have any negative effect when unfreezing ViT.

MonolithFoundation commented 2 months ago

Hi can u share how much better could be?

also if the inputs extended to video, how can we handle it?

---- Replied Message ---- | From | Baifeng @.> | | Date | 05/17/2024 11:36 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [bfshi/scaling_on_scales] Hello, nice work, just wonder if used s2 in llava, will the vit can be trained? (Issue #9) |

Hi,

In the paper the ViT is frozen. I've also tried unfreezing ViT and it's slightly better (which of course depends on your training data). I don't think S2 will have any negative effect when unfreezing ViT.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

bfshi commented 2 months ago

Hi. unfreezing the ViT can give 2-3% improvement on some benchmarks, but probably has negative effects on other benchmarks. The improvement seems larger for smaller LLMs.

For videos, you can just sample a few frames from it and extract features with S2 on each frame separately. You can try pooling each frame's feature if the total context length is too large.