issues
search
baichuan-inc
/
Baichuan-7B
A large-scale 7B pretraining language model developed by BaiChuan-Inc.
https://huggingface.co/baichuan-inc/baichuan-7B
Apache License 2.0
5.67k
stars
506
forks
source link
[Question] 数据质量打分模型具体是用什么打的分?
#78
Open
lvcc2018
opened
1 year ago
lvcc2018
commented
1 year ago
Required prerequisites
[X] I have read the documentation
https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md
.
[X] I have searched the
Issue Tracker
and
Discussions
that this hasn't already been reported. (+1 or comment there if it has.)
[X] Consider asking first in a
Discussion
.
Questions
参考相关数据工作,频率和质量是数据处理环节重点考虑的两个维度。 我们基于启发式规则和质量模型打分,对原始数据集进行篇章和句子粒度的过滤。在全量数据上,利用局部敏感哈希方法,对篇章和句子粒度做滤重。
比较好奇这里说的相关数据工作是哪些?关于质量模型是用什么训练的?
Checklist
[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.
lichen914
commented
1 year ago
我也有同样问题,老哥看到相关工作了吗
Required prerequisites
Questions
比较好奇这里说的相关数据工作是哪些?关于质量模型是用什么训练的?
Checklist