TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
661 stars 69 forks source link

About "conv_version" and PretrainTemplate . #101

Closed NyKxo1 closed 3 months ago

NyKxo1 commented 3 months ago

I found that the "conv_version" in pretrain.sh is "pretrain", so PretrainTemplate is used in TextPreprocess. When I debug, I found that PretrainTemplate loses the question, and the returned prompt and subsequent input_ids are only answers.May I ask what the purpose of this is?

YingHuTsing commented 3 months ago

Hi. sorry for late reply. This is because the pretrain stage aims to do image-text align, like, when you train a clip model , only image and its corresponding caption are passed to models. So in pretrain stage, for text input, we only need image captions, which is the answers, rather than questions and prompts.