Ucas-HaoranWei / Vary

[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.
1.77k stars 156 forks source link

Vary-tiny部分的预训练 #62

Open SLTK1 opened 8 months ago

SLTK1 commented 8 months ago

感觉类似DONUT的架构,用SAM的权重和OPT-125M替换了

Ucas-HaoranWei commented 8 months ago

太不像了吧,DONUT是cross attention,Vary-tiny是image token prefix,分别是目前VLM的两大类型

SLTK1 commented 8 months ago

太不像了吧,DONUT是cross attention,Vary-tiny是image token prefix,分别是目前VLM的两大类型

确实,感觉在使用方面你们的vary-tiny要比donut这种更灵活,不太清楚vary-tiny这部分是你们原创的还是之前有类似的工作