Question about Visual Encoder

42Shawn / LLaVA-PruMerge

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Apache License 2.0

67 stars 4 forks source link

Question about Visual Encoder #10

Open xiaokj37 opened 1 month ago

xiaokj37 commented 1 month ago

Thank you very much for open-sourcing the code of LLaVA-PurMerge. I have cloned it and found its excellent performance. I would like to ask if Visual Encoder is frozen in your implementation or I can customize it to use other Encoder pre-weights. I will be very grateful for your replay.

42Shawn commented 1 month ago

Thank you for your recognition of our work. Yes, you can implement our method to other LMMs with different visual encoders (but must be ViT-based).

xiaokj37 commented 1 month ago

Thank you for your reply. I would also like to ask about the part related to Token Reduction. I found that the shape of image_features after Token Reduction is not consistent for different images. How do you take into account the inconsistent input size in the subsequent projection?