AILab-CVC / SEED

Official implementation of SEED-LLaMA (ICLR 2024).
https://ailab-cvc.github.io/seed
Other
576 stars 31 forks source link

Questions about EVA-CLIP-G used in SEED. #18

Closed Haochen-Wang409 closed 8 months ago

Haochen-Wang409 commented 10 months ago

Impressing work!

I noticed that SEED utilized a visual encoder pre-trained by EVA-CLIP-G. The original EVA-CLIP-G has 40 blocks but SEED omitts the last block (https://github.com/AILab-CVC/SEED/blob/main/models/seed_qformer/eva_vit.py#L467). Is there any special consideration?

Haochen-Wang409 commented 10 months ago

I have downloaded the pre-trained backbone from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth and found that it contains 40 blocks.

sijeh commented 9 months ago

Sorry for the late reply, following the settings of blip2, we use the penultimate layer feature of eva-clip-g, so there are only 39 layers