https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/label_studio.py验证集和测试集默认构造全负例无效

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

12.16k stars 2.94k forks source link

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/label_studio.py验证集和测试集默认构造全负例无效 #4267

Closed AI-Mart closed 1 year ago

AI-Mart commented 1 year ago

软件环境

- paddlepaddle:2.4.1
- paddlepaddle-gpu: 
- paddlenlp: 2.4.7

重复问题

[X] I have searched the existing issues

错误描述

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/label_studio.py验证集和测试集默认构造全负例无效

稳定复现步骤 & 代码

python ../label_studio.py \
    --label_studio_file ./data/label_studio.json \
    --save_dir ./data \
    --splits 0 1 0 \
    --task_type ext \
    --layout_analysis True

python ../label_studio.py \
    --label_studio_file ./data/label_studio.json \
    --save_dir ./data \
    --splits 1 0 0 \
    --task_type ext \
    --layout_analysis True

执行以上两个代码，生成的数据集是一样的，即验证集并没有构造全负例，negative_ratio参数也无效，目前训练模型需要构造全负例的数据才能保证效果，请问如何操作才能生成全负例的样本数据？

linjieccc commented 1 year ago

@AI-Mart Hi，

请问用的是默认数据集吗？默认数据集只有10中实体标签，negative_ratio失效的原因可能是可构造的负例数量小于这个参数设定。

你可以用文本关系抽取的例子验证下负例构造逻辑：https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/text

AI-Mart commented 1 year ago

默认的数据集和自建数据集都一样，验证器并不能构造出全负例，比如不存在的实体要应该要有空的结果这样的数据。文本关系抽取也一样不能构造出全负例，https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/uie/doccano.py之前使用这个doccano,是可以在验证集上构造全负例的，能否修正下，谢谢

linjieccc commented 1 year ago

我这边用develop的代码试了下应该是正常的

方便提供一下label_studio.json么，我们复现下这个问题，感谢

AI-Mart commented 1 year ago

data.zip

AI-Mart commented 1 year ago

表格里面为空的，转换出来的样本也应该要是空的才能保证训练结果的准确性，这部分也属于负样本，目前转换出的样本，为空的没有保留。另外我发现OCR识别的结果构造出的样本本身就不怎么准确，针对我这种表格识别不是很准，不知道有没有什么优化建议

linjieccc commented 1 year ago

@AI-Mart Hi，

负例构造依赖于标注阶段定义的标签集合，且需要至少两条样本（第一条样本在构造全负例的时候会生成第二条样本中包含但本身没有的标签）

AI-Mart commented 1 year ago

感谢回复，试了好像还是不行，我指的负样本是表格里面没有值，此时应该有生成样本数据，result对应为空的，类似这种数据， "result_list": [], "prompt": "文档标题", "image": 。图片里面没有值的，对应的生成样本也应该有为空的数据，而不是就直接没有这种数据。

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。