> Traditional OCR datasets can be transformed into instruction-following datasets. For example, in the traditional OCR dataset, a data sample is an image with OCR ground truths.
We can construct a lot of questions, e.g., "please identify all the words in the image." Then, convert the OCR ground truths to the answer, e.g., "The text in the image includes:\ntext1\ntext2\ntext3"
Thank you for you answer! I couldn't understand, whether you mix OCR samples in an instruction-following format with image captioning samples during pre-training. From the code of InternVL-Chat-1.5 I couldn't figure out if the pre-training dataset is also in the instruction-following format.
Thank you for you answer! I couldn't understand, whether you mix OCR samples in an instruction-following format with image captioning samples during pre-training. From the code of InternVL-Chat-1.5 I couldn't figure out if the pre-training dataset is also in the instruction-following format.
Originally posted by @mumtozee in https://github.com/OpenGVLab/InternVL/issues/49#issuecomment-2173300435