OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
https://internvl.github.io/
MIT License
3.83k stars 293 forks source link

Pretrain OCR datasets structure #306

Open toshiks opened 1 week ago

toshiks commented 1 week ago
          > Traditional OCR datasets can be transformed into instruction-following datasets. For example, in the traditional OCR dataset, a data sample is an image with OCR ground truths.

We can construct a lot of questions, e.g., "please identify all the words in the image." Then, convert the OCR ground truths to the answer, e.g., "The text in the image includes:\ntext1\ntext2\ntext3"

Thank you for you answer! I couldn't understand, whether you mix OCR samples in an instruction-following format with image captioning samples during pre-training. From the code of InternVL-Chat-1.5 I couldn't figure out if the pre-training dataset is also in the instruction-following format.

Originally posted by @mumtozee in https://github.com/OpenGVLab/InternVL/issues/49#issuecomment-2173300435

Weiyun1025 commented 3 days ago

Thank you for your interest in our work!

The data used during the pre-training phase is in the instruction-following format.