Open xiaokj37 opened 1 month ago
Thank you for your recognition of our work. Yes, you can implement our method to other LMMs with different visual encoders (but must be ViT-based).
Thank you for your reply. I would also like to ask about the part related to Token Reduction. I found that the shape of image_features after Token Reduction is not consistent for different images. How do you take into account the inconsistent input size in the subsequent projection?
Thank you very much for open-sourcing the code of LLaVA-PurMerge. I have cloned it and found its excellent performance. I would like to ask if Visual Encoder is frozen in your implementation or I can customize it to use other Encoder pre-weights. I will be very grateful for your replay.