PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.98k stars 2.92k forks source link

[Bug]: 源码提供的Funsd和xfund_zh数据集缺少字段导致复现ERNIE-layout微调中多处报错 #6817

Open Mercurialzs opened 1 year ago

Mercurialzs commented 1 year ago

软件环境

- paddlepaddle: -
- paddlepaddle-gpu: 2.5.1
- paddlenlp: 2.6.0.post0

重复问题

错误描述


按照readme复现ERNIE-layout微调,由于机器处于离线环境,因此FUNSD与xfund_zh两个数据集通过wget方式下载。
在FUNSD和xfund_zh数据集上目前出现3个问题:
(1)按照原代码和命令行指令出现报错:
Traceback (most recent call last):
  File "run_ner.py", line 235, in <module>
    main(filename)
  File "run_ner.py", line 75, in main
    label_list, label_to_id = get_label_ld(train_ds["qas"], scheme=data_args.pattern.split("-")[1])
  File "/home/.../data/model/ERNIE-layout/utils.py", line 135, in get_label_ld
    for key in qa["question"]:
TypeError: list indices must be integers or slices, not str
目前我猜测是数据集格式的问题,因此调整utils.py中get_label_ld的代码(代码放到下面一节)后暂时解决
(2)在修改(1)的代码后,出现报错:
Traceback (most recent call last):
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1347, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/.../data/model/ERNIE-layout/utils.py", line 270, in preprocess_ner
    packed_QA = zip(qas["question"], qas["answers"])
TypeError: list indices must be integers or slices, not str
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_ner.py", line 235, in <module>
    main(filename)
  File "run_ner.py", line 121, in main
    train_dataset = train_ds.map(
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
TypeError: list indices must be integers or slices, not str
目前猜测是数据格式问题,于是尝试修改utils.py中的preprocess_ner的代码(如下节所示),暂时解决
(3)修改前两个问题的代码后,再次运行出现报错:
Traceback (most recent call last):
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1347, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/.../data/model/ERNIE-layout/utils.py", line 403, in preprocess_ner
    feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx])
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 270, in __getitem__
    value = self.data[key]
KeyError: 'page_no'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_ner.py", line 235, in <module>
    main(filename)
  File "run_ner.py", line 121, in main
    train_dataset = train_ds.map(
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
KeyError: 'page_no'
核对FUNSD和xfund_zh数据集,发现所有样本均没有'page_no'字段,因此猜测目前下载的数据集并非最终微调可用的数据集。

请帮忙核对一下数据集是否有问题(下载地址是https://bj.bcebos.com/paddlenlp/datasets/funsd.tar.gz以及https://bj.bcebos.com/paddlenlp/datasets/xfund_zh.tar.gz),或者能否提供一下数据预处理相关代码,谢谢!

稳定复现步骤 & 代码

报错1截图:image 修改1: image 报错2截图: image 修改2: image image 报错3截图: image

PancakeAwesome commented 12 months ago

我也遇到了同样的问题,请开发人员帮忙解决一下呢,谢谢 #7145