SkyworkAI / Skywork

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation methods, etc. 天工系列模型在3.2TB高质量多语言和代码数据上进行预训练。我们开源了模型参数,训练数据,评估数据,评估方法。
Other
1.21k stars 111 forks source link

Can we get more information about deduplication ratio? #10

Closed ax7e closed 10 months ago

ax7e commented 10 months ago

Thank you for your dedication to transparency and integrity in your work. I am particularly intrigued to inquire about the rate of deduplication observed throughout the pipeline process of your corpus management.

TianwenWei commented 10 months ago

Thank you for your appreciation for our work.

Our SkyPile is a combination of data from multiple sources. We have carried out inter-source deduplication and intra-source deduplication multiple times. As data from different sources have different deduplication ratio, e.g., almost no duplication among Books, while very high duplication among webpages, it is hard to estimate an "overall" deduplication ratio for the entire deduplication pipeline.

We have removed more than 90% of the webpage data due to low quality or duplication.