👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
按照readme复现ERNIE-layout微调,由于机器处于离线环境,因此FUNSD与xfund_zh两个数据集通过wget方式下载。
在FUNSD和xfund_zh数据集上目前出现3个问题:
(1)按照原代码和命令行指令出现报错:
Traceback (most recent call last):
File "run_ner.py", line 235, in <module>
main(filename)
File "run_ner.py", line 75, in main
label_list, label_to_id = get_label_ld(train_ds["qas"], scheme=data_args.pattern.split("-")[1])
File "/home/.../data/model/ERNIE-layout/utils.py", line 135, in get_label_ld
for key in qa["question"]:
TypeError: list indices must be integers or slices, not str
目前我猜测是数据集格式的问题,因此调整utils.py中get_label_ld的代码(代码放到下面一节)后暂时解决
(2)在修改(1)的代码后,出现报错:
Traceback (most recent call last):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1347, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/.../data/model/ERNIE-layout/utils.py", line 270, in preprocess_ner
packed_QA = zip(qas["question"], qas["answers"])
TypeError: list indices must be integers or slices, not str
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run_ner.py", line 235, in <module>
main(filename)
File "run_ner.py", line 121, in main
train_dataset = train_ds.map(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3189, in map
for rank, done, content in iflatmap_unordered(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
TypeError: list indices must be integers or slices, not str
目前猜测是数据格式问题,于是尝试修改utils.py中的preprocess_ner的代码(如下节所示),暂时解决
(3)修改前两个问题的代码后,再次运行出现报错:
Traceback (most recent call last):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1347, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/.../data/model/ERNIE-layout/utils.py", line 403, in preprocess_ner
feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx])
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 270, in __getitem__
value = self.data[key]
KeyError: 'page_no'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run_ner.py", line 235, in <module>
main(filename)
File "run_ner.py", line 121, in main
train_dataset = train_ds.map(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3189, in map
for rank, done, content in iflatmap_unordered(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
KeyError: 'page_no'
核对FUNSD和xfund_zh数据集,发现所有样本均没有'page_no'字段,因此猜测目前下载的数据集并非最终微调可用的数据集。
请帮忙核对一下数据集是否有问题(下载地址是https://bj.bcebos.com/paddlenlp/datasets/funsd.tar.gz以及https://bj.bcebos.com/paddlenlp/datasets/xfund_zh.tar.gz),或者能否提供一下数据预处理相关代码,谢谢!
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
报错1截图: 修改1: 报错2截图: 修改2: 报错3截图: