Training Data of Tokenizer

AILab-CVC / SEED

Official implementation of SEED-LLaMA (ICLR 2024).

https://ailab-cvc.github.io/seed

Other

576 stars 31 forks source link

Training Data of Tokenizer #27

Open zheedong opened 8 months ago

zheedong commented 8 months ago

Thanks for your great work.

In paper, you say that training data of tokenizer is 'CC3M, Unsplash, LAION-COCO, MS-COCO'. Did you use total of those three dataset? Or did you do some filtering? What is total amount of training data in tokenizer training?

And did you use same training dataset in stage 1 and stage 2 in tokenizer training?

geyuying commented 8 months ago

Yes, we use total of 'CC3M, Unsplash, LAION-COCO, MS-COCO' for training tokenizer in both stage 1 and stage2. The total amount of training data is almost 500M.

zheedong commented 8 months ago

I saw your code, but I cannot find configs about training dataset. Can you tell me more details about it? How many epochs do you train?