SkyworkAI / Skywork

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation methods, etc. 天工系列模型在3.2TB高质量多语言和代码数据上进行预训练。我们开源了模型参数,训练数据,评估数据,评估方法。
Other
1.21k stars 111 forks source link

咨询一下预训练阶段第一次预训练和第二次预训练的数据使用问题 #29

Closed zgctmac closed 10 months ago

zgctmac commented 10 months ago

第一次预训练用了通用的数据,第二次预训练加入了一些专业数据,请问第二次预训练是把第一次的数据和专业数据放在一起进行训练的还是专业数据单独训练的,如果是单独训练的会不会有通用知识遗忘的问题?望解答~

zhao1iang commented 10 months ago

第二阶段训练是20%的专业数据,80%的通用数据

zgctmac commented 10 months ago

我这边使用预训练代码发现模型读取到的词表长度是0,原始存的是65519,然后模型会resize维度,这个正常吗?

log信息: 11/07/2023 11:19:33 - INFO - main - Model vocab size: 0 11/07/2023 11:19:33 - INFO - main - len(tokenizer):65519 11/07/2023 11:19:33 - INFO - main - Resize model vocab size to 65519 [INFO|modeling_utils.py:1617] 2023-11-07 11:19:33,953 >> You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be 65519. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

zhao1iang commented 10 months ago

应该是显示的问题,测试没有发现有异常,您如果发现效果不符合预期可以联系我们,我们再深入分析下是否demo中有bug。