Closed Haochen-Wang409 closed 8 months ago
I have downloaded the pre-trained backbone from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth and found that it contains 40 blocks.
Sorry for the late reply, following the settings of blip2, we use the penultimate layer feature of eva-clip-g, so there are only 39 layers
Impressing work!
I noticed that SEED utilized a visual encoder pre-trained by EVA-CLIP-G. The original EVA-CLIP-G has 40 blocks but SEED omitts the last block (https://github.com/AILab-CVC/SEED/blob/main/models/seed_qformer/eva_vit.py#L467). Is there any special consideration?