PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.17k stars 2.95k forks source link

[Bug]: UIE-X文档抽取,数组越界 list index out of range #6292

Open Viserion-nlper opened 1 year ago

Viserion-nlper commented 1 year ago

软件环境

- paddlepaddle:最新
- paddlepaddle-gpu: 最新
- paddlenlp: 最新

重复问题

错误描述

数组越界。
和该问题一样:https://github.com/PaddlePaddle/PaddleNLP/issues/4656

稳定复现步骤 & 代码

Exception in thread Thread-2: Traceback (most recent call last): File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop batch = self._dataset_fetcher.fetch(indices, File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddle/fluid/dataloader/fetcher.py", line 125, in fetch data.append(self.dataset[idx]) File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 260, in getitem return self._transform(self.new_data[idx]) if self._transform_pipline else self.new_data[idx] File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 252, in _transform data = fn(data) File "/home/cuizhiqiang/test/PaddleNLP/applications/information_extraction/document/utils.py", line 272, in convert_example offset_bias = offset_mapping[q_sep_index - 1][-1] + 1 IndexError: list index out of range

Viserion-nlper commented 1 year ago

请抓紧时间处理下,谢谢

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人哈 谢谢

Viserion-nlper commented 1 year ago

有负责人吗

whisky-12 commented 1 year ago

好神奇的问题呀 ?

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人哈 谢谢

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人哈 谢谢

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人 没有人维护该仓库吗?

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人 没有人维护该仓库吗?

linjieccc commented 1 year ago

@Viserion-nlper Hi,麻烦提供一下复现方式,我们尝试复现下这个问题

Viserion-nlper commented 1 year ago

@linjieccc 3q 路径:PaddleNLP/applications/information_extraction/document 执行:python3 finetune.py --device gpu --logging_steps 5 --save_steps 25 --eval_steps 25 --seed 42 --model_name_or_path uie-x-base --output_dir ./checkpoint/model_best --train_path data/train.txt --dev_path data/dev.txt --max_seq_len 512 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --num_train_epochs 100 --learning_rate 1e-5 --do_train --do_eval --do_export --export_model_dir ./checkpoint/model_best --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit

data数据: 链接:https://pan.baidu.com/s/1nsZPdipEHAf3fR9j0eReRw?pwd=8888 提取码:8888 --来自百度网盘超级会员V5的分享

报错: image

Viserion-nlper commented 1 year ago

在处理数据时,添加了https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/document#:~:text=%2D%2Dlayout_analysis%20True layout_analysis后,导致使用train.txt,finetune时报以上错误。 如果不使用layout_analysis,finetune没有问题。 怀疑是layout_analysis后,导致train.txt中存在乱码字符,导致tokenizer对齐不一致,报错、 和该问题一样:https://github.com/PaddlePaddle/PaddleNLP/issues/4656

linjieccc commented 1 year ago

@Viserion-nlper 我这边没有复现出你的问题,可以正常训练,你用develop版本的paddlenlp再试试呢?

Viserion-nlper commented 1 year ago

我更新下使用develop版本试试

Viserion-nlper commented 1 year ago

@linjieccc更新了还是存在该问题: 我重新更换了数据集,这个可以复现该问题。麻烦处理下,谢谢 链接:https://pan.baidu.com/s/1m-a1Z8YjskPYquZwOMiuhg?pwd=8888 提取码:8888 --来自百度网盘超级会员V5的分享

Viserion-nlper commented 1 year ago

@linjieccc 帮看下哈

linjieccc commented 1 year ago

@Viserion-nlper 数据包含了一些空文本,可以在数据处理阶段过滤一下 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/document/utils.py#L151

tianchiguaixia commented 1 year ago

遇到这种直接try,毕竟都是少数特殊数据导致的