Open MonolithFoundation opened 6 months ago
Hi,
In the paper the ViT is frozen. I've also tried unfreezing ViT and it's slightly better (which of course depends on your training data). I don't think S2 will have any negative effect when unfreezing ViT.
Hi can u share how much better could be?
also if the inputs extended to video, how can we handle it?
---- Replied Message ---- | From | Baifeng @.> | | Date | 05/17/2024 11:36 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [bfshi/scaling_on_scales] Hello, nice work, just wonder if used s2 in llava, will the vit can be trained? (Issue #9) |
Hi,
In the paper the ViT is frozen. I've also tried unfreezing ViT and it's slightly better (which of course depends on your training data). I don't think S2 will have any negative effect when unfreezing ViT.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi. unfreezing the ViT can give 2-3% improvement on some benchmarks, but probably has negative effects on other benchmarks. The improvement seems larger for smaller LLMs.
For videos, you can just sample a few frames from it and extract features with S2 on each frame separately. You can try pooling each frame's feature if the total context length is too large.
I am curiosity about if s2 being used in llava, can we open vit to trained?
It looks like, the s2 way dealing with inputs, if the vit training is unfreezed, the result could be not very good. How u think? Any experiments on this?