DISC-Law-SFT-Pair specific information of each type of data

mawenju203 commented 1 year ago

DISC-Law-SFT-Pair中的每一种类型的数据，在模型训练过程中，是什么地位？重要是给模型提供什么信息呢？

DISC-Law-SFT-Pair:

Id 条数类别 'jud_doc_sum': 8234, 文件摘要 'jud_read_compre': 38530, 阅读理解 'leg_case_cls': 20563, 案件分类 'leg_ele_extra': 32042, 要素抽取 'leg_eve_detec': 21289, 事件检测 'op_sum': 5251, 舆情摘要 'exam': 21054, 司法考试 'sent_pred': 11657, 判决预测 'sim_case_match': 8138, 类案匹配

Charlie-XIAO commented 1 year ago

Fine-tuning essentially lets the model "learn" the type of question and familiarizes it with some relevant knowledge and answering formats. I'm not sure what you are asking about, the information that each type of instruction gives is just as listed. Ideally, what kind of task you feed, what kind of task you hope it to understand.

mawenju203 commented 1 year ago

@Charlie-XIAO，是这样的，我想训练一个法律相关的模型，目的是：如果写一个新规可能会和许多条法律文件相关，会存在和已有法律有冲突的问题，然后因着这些冲突，模型需要给出修改意见；（更适合使用三段论的方式），我现在仅仅有的法律文件，每个主题的法律文件可能很长，超过模型训练的token数，所以我们更倾向于，训练的模型知道每个主题下所有的法律的内容，然后我提问提供一个新规，模型可以自己获取对应主题的已经学习的法律内容，并给出相关的修改意见；

我的训练语料设计，首先采用一些（法律主题（input）==== 法律对应的原文内容，内容被切分多份（output））（存在一对多，因为模型的最大token是有限制的）；（法律对应的原文内容，内容被切分多份（input）==== 法律主题（output））类似您的训练集合（ 'leg_case_cls': 20563, 案件分类）；又有一些类似（新规（input）==== 根据一些已经存在的法律文件给出的建议（output））（同样存在一对多）；类似您的训练集合（'sent_pred': 11657, 判决预测）

数据设计逻辑是：通过对新规的主题进行判断，获取新规的主题，然后模型可以通过主题找到对应的法律原文，然后根据法律原文，给出对应的修改意见；

所以才比较关注，您设计的每个类别的意义，以及用途；如何用到我的模型中，实现相关的学习以及应用；

Charlie-XIAO commented 1 year ago

I have to admit that, the proportion of data for each task in our model may not be optimal, so I would recommend trial-and-error for this. There are some details about the dataset design, as mentioned in our technical report, for instance, why the DISC-Law-SFT-Triplet is needed. However in general, datasets are just designed to suit what abilities we want to equip our model with.

mawenju203 commented 1 year ago

@Charlie-XIAO 谢谢你的回复。

Charlie-XIAO commented 1 year ago

Closing as completed.

FudanDISC / DISC-LawLLM

DISC-Law-SFT-Pair specific information of each type of data #20