Open zheedong opened 8 months ago
Yes, we use total of 'CC3M, Unsplash, LAION-COCO, MS-COCO' for training tokenizer in both stage 1 and stage2. The total amount of training data is almost 500M.
I saw your code, but I cannot find configs about training dataset. Can you tell me more details about it? How many epochs do you train?
Thanks for your great work.
In paper, you say that training data of tokenizer is 'CC3M, Unsplash, LAION-COCO, MS-COCO'. Did you use total of those three dataset? Or did you do some filtering? What is total amount of training data in tokenizer training?
And did you use same training dataset in stage 1 and stage 2 in tokenizer training?