咨询一下预训练阶段第一次预训练和第二次预训练的数据使用问题

zgctmac commented 10 months ago

第一次预训练用了通用的数据，第二次预训练加入了一些专业数据，请问第二次预训练是把第一次的数据和专业数据放在一起进行训练的还是专业数据单独训练的，如果是单独训练的会不会有通用知识遗忘的问题？望解答~

zhao1iang commented 10 months ago

第二阶段训练是20%的专业数据，80%的通用数据

zgctmac commented 10 months ago

我这边使用预训练代码发现模型读取到的词表长度是0，原始存的是65519，然后模型会resize维度，这个正常吗？

log信息： 11/07/2023 11:19:33 - INFO - main - Model vocab size: 0 11/07/2023 11:19:33 - INFO - main - len(tokenizer):65519 11/07/2023 11:19:33 - INFO - main - Resize model vocab size to 65519 [INFO|modeling_utils.py:1617] 2023-11-07 11:19:33,953 >> You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be 65519. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

zhao1iang commented 10 months ago

应该是显示的问题，测试没有发现有异常，您如果发现效果不符合预期可以联系我们，我们再深入分析下是否demo中有bug。

SkyworkAI / Skywork

咨询一下预训练阶段第一次预训练和第二次预训练的数据使用问题 #29