dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

why stage 1 and 2 use differenct ` --version plain_guided ` ` --version imgsp_v1 ` parameters? #55

Closed dragen1860 closed 5 months ago

dragen1860 commented 5 months ago

Dear all: why stage 1 and 2 use different --version plain_guided --version imgsp_v1 parameters? thank you.

yanwei-li commented 5 months ago

Hi, because in stage 1, we do not append instructions before (or after) image tokens to LLM following that in LLaVA. In stage 2, we append user instructions in each conversation turn.