Closed Richar-Du closed 1 year ago
Sorry, we have typo that around 10.4 billion tokens is used. Since the average caption length would be 52 tokens. For the whole training data we used in first stage, we went through about 200M image-text pairs. So the total token would be 0.2B * 52 = 10.4B.
According to the paper, the training data in the 1st stage is 104 billion tokens. Since the captions are short, we assume each caption has 20 tokens. 104B/20 = 5200M captions, which is amazing. Maybe my calculation is wrong, would you mind explaining the number of captions you used during the 1st training stage? Thanks in advance.