TRI-ML / prismatic-vlms

A flexible and efficient codebase for training visually-conditioned language models (VLMs)
MIT License
425 stars 194 forks source link

Add New LLM Backbones #27

Closed siddk closed 4 months ago

siddk commented 4 months ago

Adds Llama-2 Chat, Mistral v0.1, Mistral v0.1 Instruct, Phi-2 LLMs. Note that these model configs match the structure of our paper (one-off changes on top of the One_Stage base configuration). All of these models can be improved by training with the "Prism" configuration (extra data, DINO + SigLIP backbones, etc.).

Evaluation Results:

VQAv2 GQA VizWiz TextVQA (Pure/OCR) RefCOCO+ OCID-Ref VSR POPE TallyQA
LLaVa v1.5 7B (Base) 76.54 61.58 54.25 46.13 / 58.25 49.47 35.07 51.47 86.57 62.06
Llama-2 Chat 7B 76.92 62.11 56.39 45.3 / 56.6 58.5 46.3 61.8 86.8 58.1
Llama-2 Chat 13B 78.0 63.60 56.43 57.2 / 58.4 62.9 44.9 71.4 86.8 58.9
Mistral v0.1 7B 77.30 63.30 55.32 44.4 / 49.3 65.1 48.8 58.5 87.1 61.7
Mistral Instruct v0.1 7B 77.13 62.71 54.35 44.1 / 50.5 64.9 48.0 57.8 87.5 64.5
Phi-2 3B 41.47 33.38 12.18 6.6 / 31.0 5.7 1.5 48.7 48.2 20.2
Llama-2 (Best LLM from Paper) 77.08 62.44 55.98 44.92 / 55.24 59.47 43.89 63.67 86.74 59.22
Prism DINOSigLIP 7B (Controlled) 79.05 64.16 59.82 51.78 / 58.69 67.85 50.56 66.28 88.28 65.07

Note that Phi-2 results are fairly poor; would be good to dig into this (maybe I did something wrong with the prompting scheme).

Hopefully, this PR also serves as a template for folks looking to add their own LLMs to Prismatic -- low-hanging fruit include adding Gemma, Llama-3, Phi-3 LLMs!


CC @shikhar-srivastava @zeyuanyin @Hannibal046 @RylanSchaeffer

Resolves #6 Resolves #25