dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

About the json in stage2 and stage3 #79

Open liziming5353 opened 3 months ago

liziming5353 commented 3 months ago

Why does the data in stage2 and 3 contains pure text Q&A without images or videos?

Becomebright commented 3 months ago

According to DeepSeek-VL,

Maintaining a significant proportion of language data—specifically, at least 70%—is essential to preserve the integrity of language knowledge within the model.