Closed kenxxxxx closed 4 months ago
Hi, thank you for your question.
For InstructBLIP and SEED-LLaMA integration, we use EVA-CLIP-g/14(embedding dim: 1408) for ViT-Lens training, which is different from ViT-Lens-L
(based on ViT Large). Since we use the same ViT as InstructBLIP and SEED-LLaMA, we directly plug the Lens and modality module prior to ViT layers.
Thanks for your answer! Do you plan to upload the models used for integration?
Hi, thank you for your question.
For InstructBLIP and SEED-LLaMA integration, we use EVA-CLIP-g/14(embedding dim: 1408) for ViT-Lens training, which is different from
ViT-Lens-L
(based on ViT Large). Since we use the same ViT as InstructBLIP and SEED-LLaMA, we directly plug the Lens and modality module prior to ViT layers.
Yes. Currently some ckpt can be found on hf(3D). I am working on cleaning the code for integration pipeline and plan to release it within one month due to limited bandwidth.
Okay thanks!
Nice work! I am wondering if there is any update regarding the release of code?
The tensor matrix output by vitlens is 1*768 for each modal message right? So where in Instructblip do I plug it in can you please answer? Thanks!