SkyworkAI / Skywork

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation methods, etc. 天工系列模型在3.2TB高质量多语言和代码数据上进行预训练。我们开源了模型参数,训练数据,评估数据,评估方法。
Other
1.21k stars 111 forks source link

数据问题 #70

Closed yajunDai closed 8 months ago

yajunDai commented 8 months ago

请问开源数据,head和middle代表什么?是否和数据质量相关?

TianwenWei commented 8 months ago

ccnet分类器分出来的结果,按得分分桶为head, middle, tail。得分越高越接近wiki或wiki的reference,但不一定文本质量越高。