X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.25k stars 171 forks source link

The training data in the first stage #25

Closed Richar-Du closed 1 year ago

Richar-Du commented 1 year ago

According to the paper, the training data in the 1st stage is 104 billion tokens. Since the captions are short, we assume each caption has 20 tokens. 104B/20 = 5200M captions, which is amazing. Maybe my calculation is wrong, would you mind explaining the number of captions you used during the 1st training stage? Thanks in advance.

MAGAer13 commented 1 year ago

Sorry, we have typo that around 10.4 billion tokens is used. Since the average caption length would be 52 tokens. For the whole training data we used in first stage, we went through about 200M image-text pairs. So the total token would be 0.2B * 52 = 10.4B.