dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

is eva_vit_g.pth trained by yourself? #56

Closed Deaddawn closed 5 months ago

Deaddawn commented 5 months ago

Hi, I'm wondering if eva_vit_g.pth(visual encoder in the paper) trained by yourself or it comes from this paper "https://arxiv.org/pdf/2211.07636.pdf"?

dragen1860 commented 5 months ago

it comes from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth and keep frozen in all stage 1,2,3 .