[Bug]: UIE-X文档抽取，数组越界 list index out of range

Viserion-nlper commented 1 year ago

软件环境

- paddlepaddle:最新
- paddlepaddle-gpu: 最新
- paddlenlp: 最新

重复问题

[X] I have searched the existing issues

错误描述

数组越界。
和该问题一样：https://github.com/PaddlePaddle/PaddleNLP/issues/4656

稳定复现步骤 & 代码

Exception in thread Thread-2: Traceback (most recent call last): File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop batch = self._dataset_fetcher.fetch(indices, File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddle/fluid/dataloader/fetcher.py", line 125, in fetch data.append(self.dataset[idx]) File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 260, in getitem return self._transform(self.new_data[idx]) if self._transform_pipline else self.new_data[idx] File "/root/p4/conda/anaconda3/envs/paddle_yuyi/lib/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 252, in _transform data = fn(data) File "/home/cuizhiqiang/test/PaddleNLP/applications/information_extraction/document/utils.py", line 272, in convert_example offset_bias = offset_mapping[q_sep_index - 1][-1] + 1 IndexError: list index out of range

Viserion-nlper commented 1 year ago

请抓紧时间处理下，谢谢

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人哈谢谢

Viserion-nlper commented 1 year ago

有负责人吗

whisky-12 commented 1 year ago

好神奇的问题呀？

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人哈谢谢

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人哈谢谢

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人没有人维护该仓库吗？

Viserion-nlper commented 1 year ago

麻烦分配下负责处理人没有人维护该仓库吗？

linjieccc commented 1 year ago

@Viserion-nlper Hi，麻烦提供一下复现方式，我们尝试复现下这个问题

Viserion-nlper commented 1 year ago

@linjieccc 3q 路径：PaddleNLP/applications/information_extraction/document 执行：python3 finetune.py --device gpu --logging_steps 5 --save_steps 25 --eval_steps 25 --seed 42 --model_name_or_path uie-x-base --output_dir ./checkpoint/model_best --train_path data/train.txt --dev_path data/dev.txt --max_seq_len 512 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --num_train_epochs 100 --learning_rate 1e-5 --do_train --do_eval --do_export --export_model_dir ./checkpoint/model_best --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit

data数据：链接：https://pan.baidu.com/s/1nsZPdipEHAf3fR9j0eReRw?pwd=8888 提取码：8888 --来自百度网盘超级会员V5的分享

报错：

Viserion-nlper commented 1 year ago

在处理数据时，添加了https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/document#:~:text=%2D%2Dlayout_analysis%20True layout_analysis后，导致使用train.txt，finetune时报以上错误。如果不使用layout_analysis，finetune没有问题。怀疑是layout_analysis后，导致train.txt中存在乱码字符，导致tokenizer对齐不一致，报错、和该问题一样：https://github.com/PaddlePaddle/PaddleNLP/issues/4656

linjieccc commented 1 year ago

@Viserion-nlper 我这边没有复现出你的问题，可以正常训练，你用develop版本的paddlenlp再试试呢？

Viserion-nlper commented 1 year ago

我更新下使用develop版本试试

Viserion-nlper commented 1 year ago

@linjieccc更新了还是存在该问题：我重新更换了数据集，这个可以复现该问题。麻烦处理下，谢谢链接：https://pan.baidu.com/s/1m-a1Z8YjskPYquZwOMiuhg?pwd=8888 提取码：8888 --来自百度网盘超级会员V5的分享

Viserion-nlper commented 1 year ago

@linjieccc 帮看下哈

linjieccc commented 1 year ago

@Viserion-nlper 数据包含了一些空文本，可以在数据处理阶段过滤一下 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/document/utils.py#L151

tianchiguaixia commented 1 year ago

遇到这种直接try，毕竟都是少数特殊数据导致的

PaddlePaddle / PaddleNLP