PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.98k stars 2.92k forks source link

ernie-layout mrc 拒答多答问题[Question]: #3752

Closed hehuang139 closed 1 year ago

hehuang139 commented 1 year ago

请问,ernie-layout run_mrc 如何处理拒答和多答案?

能否通过数据标注来解决,拒答的label都为0, 多答拆分example

是否支持 A Multi-Type Multi-Span Network for Reading Comprehension that Requires Discrete Reasoning 这种multi-span

linjieccc commented 1 year ago

Hi, @hehuang139

1)拒答 可以参考https://github.com/PaddlePaddle/PaddleNLP/blob/v2.4.2/model_zoo/ernie-layout/utils.py#L970 postprocess_mrc滑窗部分的处理,滑窗内取不到的会作为负例 2)多答 多答需要修改下解码方式,可以参考UIE这里的解码逻辑换成sigmoid https://github.com/PaddlePaddle/PaddleNLP/blob/v2.4.2/model_zoo/uie/model.py#L37 损失函数也对应换成BCELoss

hehuang139 commented 1 year ago

Hi, @hehuang139

1)拒答 可以参考https://github.com/PaddlePaddle/PaddleNLP/blob/v2.4.2/model_zoo/ernie-layout/utils.py#L970 postprocess_mrc滑窗部分的处理,滑窗内取不到的会作为负例 2)多答 多答需要修改下解码方式,可以参考UIE这里的解码逻辑换成sigmoid https://github.com/PaddlePaddle/PaddleNLP/blob/v2.4.2/model_zoo/uie/model.py#L37 损失函数也对应换成BCELoss

@linjieccc 感谢,我试下

hehuang139 commented 1 year ago

@linjieccc 看了下滑动窗口的逻辑,获取5个候选,然后在排除掉问题数据(比如start > end,长度等),获取不到会直接continue过滤掉,并没有看到负值的prob,而且我观察到prob不是一个100%,是一个end_logit、start_logit不像是个概率,我得到的值是20~30以下的值。 我目前的做法是对于start_logit+ end_logit < 0则认为无答案,这样是否可以 同时我还有一个疑惑,就是我如果训练模型,拒答,不仅仅是预测,我的训练集是否需要给定无答案这种example?来训练模型,还是不需要无答案的example来训练?

start_logits = preds[0][idx]
                end_logits = preds[1][idx]

                start_indexes = self._get_best_indexes(start_logits,
                                                       n_best_size)
                end_indexes = self._get_best_indexes(end_logits, n_best_size)
                token_is_max_context = self.features_cache[
                    "token_is_max_context"][idx]

                for start_index in start_indexes:
                    for end_index in end_indexes:
                        if not token_is_max_context.get(str(start_index),
                                                        False):
                            continue
                        if end_index < start_index:
                            continue
                        length = end_index - start_index + 1
                        if length > max_answer_length:
                            continue
                        prelim_predictions.append(
                            self._PrelimPrediction(
                                feature_index=idx,
                                start_index=start_index,
                                end_index=end_index,
                                start_logit=start_logits[start_index],
                                end_logit=end_logits[end_index]))

            prelim_predictions = sorted(prelim_predictions,
                                        key=lambda x:
                                        (x.start_logit + x.end_logit),
                                        reverse=True)
linjieccc commented 1 year ago

通过卡阈值来判断no answer会好些,阈值的设置需要实际测下;训练集中可以构造no answer的example

jyjy007 commented 1 year ago

请问下,纯文本的mrc,用什么模型合适?感觉用dureader合适,可是好像感觉支持的不是很好,用起来有点麻烦,有没有像uie/docpromt之类用taskflow这种封装比较完整的?

hehuang139 commented 1 year ago

@linjieccc uies-x no answer似乎效果不是非常好,我现在是有2个entity,一个是"项目编号", 一个是"机构编号",一般都有项目编号,而没有机构编号,我构建了大约200个example,其中有不少机构编号的负例,但是还是会抽取出抽取机构编号,并且用的是项目编号的值,

以下是我训练的f1,都已经1了,还是抽取的有问题 [2022-11-24 14:21:48,996] [ INFO] - eval_loss: 0.00011530211486387998, eval_precision: 1.0, eval_recall: 1.0, eval_f1: 1.0, eval_runtime: 3.0387, eval_samples_per_second: 13.822, eval_steps_per_second: 1.975, epoch: 10.0 [2022-11-24 14:21:48,997] [ INFO] - eval metrics [2022-11-24 14:21:48,997] [ INFO] - epoch = 10.0 [2022-11-24 14:21:48,997] [ INFO] - eval_f1 = 1.0 [2022-11-24 14:21:48,997] [ INFO] - eval_loss = 0.0001 [2022-11-24 14:21:48,997] [ INFO] - eval_precision = 1.0 [2022-11-24 14:21:48,997] [ INFO] - eval_recall = 1.0 [2022-11-24 14:21:48,997] [ INFO] - eval_runtime = 0:00:03.03 [2022-11-24 14:21:48,997] [ INFO] - eval_samples_per_second = 13.822 [2022-11-24 14:21:48,998] [ INFO] - eval_steps_per_second = 1.975 [2022-11-24 14:21:49,000] [ INFO] - Exporting inference model to ./checkpoint/model_best_ne/model [2022-11-24 14:22:17,764] [ INFO] - Inference model exported

这里是我的抽取结果,可以看到确实会低些,但是也有90%左右, {'机构编号': {'end': 53, 'probability': 0.902911906953598, 'start': 38, 'text': 'XMYD字025号'}, '项目编号': {'end': 53, 'probability': 0.9910101268182814, 'start': 38, 'text': 'XMYD字025号'}, }

以下是正文,做了些脱敏 2021年第三批*岸修复工程 项目编号:XMYD****字025号 二○二一年十一月2021年

以下是我构建的负例 {"content": "正文内容", "result_list": [], "prompt": "机构编号"

是否有什么建议可以提高拒答的效果,是继续增加负例吗?还是获取阈值(阈值要怎么取,是找到最大值吗)

linjieccc commented 1 year ago

请问下,纯文本的mrc,用什么模型合适?感觉用dureader合适,可是好像感觉支持的不是很好,用起来有点麻烦,有没有像uie/docpromt之类用taskflow这种封装比较完整的?

@jyjy007 关于dureader的使用体验问题可以提个feature request的issue

linjieccc commented 1 year ago

@linjieccc uies-x no answer似乎效果不是非常好,我现在是有2个entity,一个是"项目编号", 一个是"机构编号",一般都有项目编号,而没有机构编号,我构建了大约200个example,其中有不少机构编号的负例,但是还是会抽取出抽取机构编号,并且用的是项目编号的值,

以下是我训练的f1,都已经1了,还是抽取的有问题 [2022-11-24 14:21:48,996] [ INFO] - eval_loss: 0.00011530211486387998, eval_precision: 1.0, eval_recall: 1.0, eval_f1: 1.0, eval_runtime: 3.0387, eval_samples_per_second: 13.822, eval_steps_per_second: 1.975, epoch: 10.0 [2022-11-24 14:21:48,997] [ INFO] - eval metrics [2022-11-24 14:21:48,997] [ INFO] - epoch = 10.0 [2022-11-24 14:21:48,997] [ INFO] - eval_f1 = 1.0 [2022-11-24 14:21:48,997] [ INFO] - eval_loss = 0.0001 [2022-11-24 14:21:48,997] [ INFO] - eval_precision = 1.0 [2022-11-24 14:21:48,997] [ INFO] - eval_recall = 1.0 [2022-11-24 14:21:48,997] [ INFO] - eval_runtime = 0:00:03.03 [2022-11-24 14:21:48,997] [ INFO] - eval_samples_per_second = 13.822 [2022-11-24 14:21:48,998] [ INFO] - eval_steps_per_second = 1.975 [2022-11-24 14:21:49,000] [ INFO] - Exporting inference model to ./checkpoint/model_best_ne/model [2022-11-24 14:22:17,764] [ INFO] - Inference model exported

这里是我的抽取结果,可以看到确实会低些,但是也有90%左右, {'机构编号': {'end': 53, 'probability': 0.902911906953598, 'start': 38, 'text': 'XMYD_字025号'}, '项目编号': {'end': 53, 'probability': 0.9910101268182814, 'start': 38, 'text': 'XMYD_字025号'}, }

以下是正文,做了些脱敏 2021年第三批*岸修复工程 项目编号:XMYD**__字025号 二○二一年十一月2021年

以下是我构建的负例 {"content": "正文内容", "result_list": [], "prompt": "机构编号"

是否有什么建议可以提高拒答的效果,是继续增加负例吗?还是获取阈值(阈值要怎么取,是找到最大值吗)

@hehuang139 目前训练和验证集分别有几条数据?

hehuang139 commented 1 year ago

@linjieccc 目前train160和dev40,其中拒答有大约dev中有8条负例,train中有37条,我甚至用了一种极端方式,将dev都放到train中,并且用已训练过的example在重新eval还是不理想

我怀疑是不是这2个要素的语义太接近了,但是从人的角度,这两个编码其实含义差距很远的,毕竟一个是项目的编号,一个是机构的编号

hehuang139 commented 1 year ago

@linjieccc 感谢关注,我又加了大约10个负例,效果突然变得非常好,f1大约97,那看来还是需要一定的负例数量,就能提升拒答的效果

[2022-11-24 15:48:55,467] [ INFO] - eval_loss: 0.00038081029197201133, eval_precision: 0.9777777777777777, eval_recall: 0.9777777777777777, eval_f1: 0.9777777777777777, eval_runtime: 3.7899, eval_samples_per_second: 13.721, eval_steps_per_second: 1.847, epoch: 10.0 [2022-11-24 15:48:55,468] [ INFO] - eval metrics [2022-11-24 15:48:55,468] [ INFO] - epoch = 10.0 [2022-11-24 15:48:55,468] [ INFO] - eval_f1 = 0.9778 [2022-11-24 15:48:55,468] [ INFO] - eval_loss = 0.0004 [2022-11-24 15:48:55,468] [ INFO] - eval_precision = 0.9778 [2022-11-24 15:48:55,468] [ INFO] - eval_recall = 0.9778 [2022-11-24 15:48:55,468] [ INFO] - eval_runtime = 0:00:03.78 [2022-11-24 15:48:55,468] [ INFO] - eval_samples_per_second = 13.721 [2022-11-24 15:48:55,468] [ INFO] - eval_steps_per_second = 1.847 [2022-11-24 15:48:55,471] [ INFO] - Exporting inference model to ./checkpoint/model_best_ne/model [2022-11-24 15:49:20,490] [ INFO] - Inference model exported.