TencentARC / ViT-Lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
https://ailab-cvc.github.io/seed/vitlens/
Other
140 stars 9 forks source link

plug in problem #11

Closed kenxxxxx closed 4 months ago

kenxxxxx commented 4 months ago

The tensor matrix output by vitlens is 1*768 for each modal message right? So where in Instructblip do I plug it in can you please answer? Thanks!

StanLei52 commented 4 months ago

Hi, thank you for your question.

For InstructBLIP and SEED-LLaMA integration, we use EVA-CLIP-g/14(embedding dim: 1408) for ViT-Lens training, which is different from ViT-Lens-L (based on ViT Large). Since we use the same ViT as InstructBLIP and SEED-LLaMA, we directly plug the Lens and modality module prior to ViT layers.

kenxxxxx commented 4 months ago

Thanks for your answer! Do you plan to upload the models used for integration?

Hi, thank you for your question.

For InstructBLIP and SEED-LLaMA integration, we use EVA-CLIP-g/14(embedding dim: 1408) for ViT-Lens training, which is different from ViT-Lens-L (based on ViT Large). Since we use the same ViT as InstructBLIP and SEED-LLaMA, we directly plug the Lens and modality module prior to ViT layers.

StanLei52 commented 4 months ago

Yes. Currently some ckpt can be found on hf(3D). I am working on cleaning the code for integration pipeline and plan to release it within one month due to limited bandwidth.

kenxxxxx commented 4 months ago

Okay thanks!

cfeng16 commented 3 months ago

Nice work! I am wondering if there is any update regarding the release of code?