Closed NyKxo1 closed 3 months ago
Hi. sorry for late reply. This is because the pretrain stage aims to do image-text align, like, when you train a clip model , only image and its corresponding caption are passed to models. So in pretrain stage, for text input, we only need image captions, which is the answers, rather than questions and prompts.
I found that the "conv_version" in pretrain.sh is "pretrain", so PretrainTemplate is used in TextPreprocess. When I debug, I found that PretrainTemplate loses the question, and the returned prompt and subsequent input_ids are only answers.May I ask what the purpose of this is?