[Question] 训练垂直领域的模型，增量预训练的token数需要达到多少才能有比较好的效果？

baichuan-inc / Baichuan-7B

A large-scale 7B pretraining language model developed by BaiChuan-Inc.

https://huggingface.co/baichuan-inc/baichuan-7B

Apache License 2.0

5.67k stars 504 forks source link

[Question] 训练垂直领域的模型，增量预训练的token数需要达到多少才能有比较好的效果？ #112

Open parkLGW opened 1 year ago

parkLGW commented 1 year ago

Required prerequisites

[X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[X] Consider asking first in a Discussion.

Questions

训练垂直领域的模型，增量预训练的token数需要达到多少才能有比较好的效果呢？

Checklist

[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.

hingkan commented 1 year ago

我想请教下大佬，在trian.py中只输入了tokenizer_path，没有给input_model_path，如何做到增量预训练的呀

parkLGW commented 1 year ago

我想请教下大佬，在trian.py中只输入了tokenizer_path，没有给input_model_path，如何做到增量预训练的呀

模型和分词器不都在同一个路径下吗

hingkan commented 1 year ago

我在md文件中看到“下载 tokenizer 模型文件 tokenizer.model，放置在项目目录下。”，就以为是基于tokenizer.model对模型进行重训练。当时就想着模型有个默认路径或者在调用modeling_baichuan.py时下载模型，如模型和分词器默认在一个文件夹加载就明白了。谢谢解惑~ 我想顺便请教下，预训练数据是哪种格式呢：格式一： “”“ doc1 doc2 doc3 ... “”“ 格式二： ”“” {"text": "doc1"} {"text": "doc2"} {"text": "doc3"} ... “”“