Closed zcg-joker closed 1 year ago
单独拿出来在notebook里使用就不会报错,能正常返回。
output=tokenizer(text=['被害人'],text_pair=['经审理查明,2012年9月13日5时许,'],truncation=True,
max_seq_len=32,
pad_to_max_seq_len=True,
return_attention_mask=True,
return_position_ids=True,
return_dict=False,
return_offsets_mapping=True)
print(output)
[{'input_ids': [101, 41979, 102, 3845, 15803, 21925, 117, 43161, 1625, 130, 2334, 5681, 2243, 126, 2251, 4611, 117, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'offset_mapping': [(0, 0), (0, 3), (0, 0), (0, 1), (1, 3), (3, 5), (5, 6), (6, 10), (10, 11), (11, 12), (12, 13), (13, 15), (15, 16), (16, 17), (17, 18), (18, 19), (19, 20), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0)], 'position_ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}]
@zcg-joker 你能给出稳定复现问题的 example 吗?通过上面的描述没办法找到具体问题所在。
@zcg-joker 你能给出稳定复现问题的 example 吗?通过上面的描述没办法找到具体问题所在。
我是在执行pipelines的端到端智能问答时遇到了这个报错, 环境是用Anaconda新建的虚拟环境,python=3.7.13, paddlepaddle-gpu的安装参考了官网的基于conda安装的命令,硬件版本: cuda: 11.4 nvidia-driver: 470.141.03 显卡:RTX-3090
环境中python包的版本: python=3.7.13 paddlepaddle-gpu=2.4.0.post112 paddlenlp=2.4.4 cudatoolkit=11.2.2 conda-forge cudnn=8.2.1.32 conda-forge
在执行快速https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/question-answering中的3.3一键体验问答系统的命令时,报错如下:
python examples/question-answering/dense_qa_example.py --device gpu
INFO - pipelines.document_stores.faiss - document_cnt:1318 embedding_cnt:1318
INFO - pipelines.utils.common_utils - Using devices: PLACE(GPU:0)
INFO - pipelines.utils.common_utils - Number of GPUs: 1
[2022-11-30 15:48:04,021] [ INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa_zh_dureader_query_encoder.pdparams
W1130 15:48:04.025661 954735 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.4, Runtime API Version: 11.2
W1130 15:48:04.029268 954735 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022-11-30 15:48:07,330] [ INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa_zh_dureader_query_encoder.pdparams
[2022-11-30 15:48:08,133] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'rocketqa-zh-dureader-query-encoder'.
[2022-11-30 15:48:08,134] [ INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa-zh-dureader-vocab.txt
[2022-11-30 15:48:08,144] [ INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/tokenizer_config.json
[2022-11-30 15:48:08,145] [ INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/special_tokens_map.json
[2022-11-30 15:48:08,145] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'rocketqa-zh-dureader-query-encoder'.
[2022-11-30 15:48:08,145] [ INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa-zh-dureader-vocab.txt
[2022-11-30 15:48:08,156] [ INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/tokenizer_config.json
[2022-11-30 15:48:08,156] [ INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/special_tokens_map.json
INFO - pipelines.utils.logger - Logged parameters:
{'processor': 'TextSimilarityProcessor', 'tokenizer': 'NoneType', 'max_seq_len': '0', 'dev_split': '0.1'}
INFO - pipelines.utils.common_utils - Using devices: PLACE(GPU:0)
INFO - pipelines.utils.common_utils - Number of GPUs: 1
Loading Parameters from:rocketqa-zh-dureader-cross-encoder
[2022-11-30 15:48:08,157] [ INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/rocketqa_zh_dureader_cross_encoder.pdparams
[2022-11-30 15:48:08,933] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'rocketqa-zh-dureader-cross-encoder'.
[2022-11-30 15:48:08,933] [ INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/rocketqa-zh-dureader-vocab.txt
[2022-11-30 15:48:08,944] [ INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/tokenizer_config.json
[2022-11-30 15:48:08,944] [ INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/special_tokens_map.json
INFO - pipelines.utils.common_utils - Using devices: PLACE(GPU:0)
INFO - pipelines.utils.common_utils - Number of GPUs: 1
[2022-11-30 15:48:08,946] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering'> to load 'ernie-gram-zh-finetuned-dureader-robust'.
[2022-11-30 15:48:08,946] [ INFO] - Already cached /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/model_state.pdparams
[2022-11-30 15:48:09,844] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer'> to load 'ernie-gram-zh-finetuned-dureader-robust'.
[2022-11-30 15:48:09,844] [ INFO] - Already cached /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/vocab.txt
[2022-11-30 15:48:09,854] [ INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/tokenizer_config.json
[2022-11-30 15:48:09,855] [ INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/special_tokens_map.json
INFO - pipelines.utils.logger - Logged parameters:
{'processor': 'SquadProcessor', 'tokenizer': 'ErnieGramTokenizer', 'max_seq_len': '256', 'dev_split': '0'}
Traceback (most recent call last):
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/base.py", line 464, in run
"component"]._dispatch_run(**node_input)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 130, in _dispatch_run
return self._dispatch_run_general(self.run, **kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 176, in _dispatch_run_general
output, stream = run_method(**run_inputs, **run_params)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 109, in run
results = predict(query=query, documents=documents, top_k=top_k)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 147, in wrapper
ret = fn(*args, **kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/ernie_dureader.py", line 227, in predict
dicts, indices=indices, return_baskets=True)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/data_handler/processor.py", line 305, in dataset_from_dicts
indices)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/utils/tokenization.py", line 63, in tokenize_batch_question_answering
add_special_tokens=False)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2484, in batch_encode
**kwargs,
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1147, in _batch_encode_plus
**kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1295, in _batch_prepare_for_model
**kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2836, in prepare_for_model
token_offset_mapping = self.get_offset_mapping(text)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1404, in get_offset_mapping
return self._get_bert_like_offset_mapping(text)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1371, in _get_bert_like_offset_mapping
start = text[offset:].index(token) + offset
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "examples/question-answering/dense_qa_example.py", line 132, in <module>
dense_qa_pipeline()
File "examples/question-answering/dense_qa_example.py", line 115, in dense_qa_pipeline
prediction = pipe.run(query="北京市有多少个行政区?", params=pipeline_params)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/standard_pipelines.py", line 229, in run
output = self.pipeline.run(query=query, params=params, debug=debug)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/base.py", line 468, in run
f"Exception while running node `{node_id}` with input `{node_input}`: {e}, full stack trace: {tb}"
Exception: Exception while running node `Reader` with input `{'documents': [<Document: {'content': '北京(Beijing),简称“京”,古称燕京、北平,是中华人民共和国的首都、直辖市、国家中心城市、超大城市,国务院批复确定的中国政治中心、文化中心、国际交往中心、科技创新中心,截至2020年,全市下辖16个区,总面积16410.54平方千米。2021年末,北京市常住人口2188.6万人,比上年末减少0.4万人。2021年,北京市全年实现地区生产总值40269.6亿元,按不变价格计算,比上年增长8.5%。全市人均地区生产总值为18.4万元。', 'content_type': 'text', 'score': 0.013789152, 'meta': {'name': '298.txt', 'vector_id': '1132'}, 'embedding': None, 'id': 'dcae69b2307e76f2e8f28f1ff4108a17', 'id_hash_keys': None, '__pydantic_initialised__': True, 'ann_score': 0.9838276330088003, 'rank_score': 0.013789152}>], 'root_node': 'Query', 'params': {'Retriever': {'top_k': 50}, 'Ranker': {'top_k': 1}, 'Reader': {'top_k': 1}}, 'query': '北京市有多少个行政区?', 'node_id': 'Reader'}`: substring not found, full stack trace: Traceback (most recent call last):
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/base.py", line 464, in run
"component"]._dispatch_run(**node_input)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 130, in _dispatch_run
return self._dispatch_run_general(self.run, **kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 176, in _dispatch_run_general
output, stream = run_method(**run_inputs, **run_params)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 109, in run
results = predict(query=query, documents=documents, top_k=top_k)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 147, in wrapper
ret = fn(*args, **kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/ernie_dureader.py", line 227, in predict
dicts, indices=indices, return_baskets=True)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/data_handler/processor.py", line 305, in dataset_from_dicts
indices)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/utils/tokenization.py", line 63, in tokenize_batch_question_answering
add_special_tokens=False)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2484, in batch_encode
**kwargs,
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1147, in _batch_encode_plus
**kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1295, in _batch_prepare_for_model
**kwargs)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2836, in prepare_for_model
token_offset_mapping = self.get_offset_mapping(text)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1404, in get_offset_mapping
return self._get_bert_like_offset_mapping(text)
File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1371, in _get_bert_like_offset_mapping
start = text[offset:].index(token) + offset
ValueError: substring not found
同遇到,异常字符都会报错
我出现问题的场景与这个场景类似
我刚也遇到了升级到新的2.4.4之后报同样的错误,而之前的2.4.2并不报错
@noobexplore @zcg-joker @Deeeerek 我现在正在发起一个 pr 来修复这个问题,可是我不确定这个能否覆盖你们所有的场景,所以我想请你们把你们的场景数据贴出来,用一小段代码来描述问题,就像: https://github.com/PaddlePaddle/PaddleNLP/issues/3985 里面的代码:
from paddlenlp.transformers import BertModel, BertTokenizer, ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained("ernie-3.0-base-zh")
sentence = "以σπδ来描述,包括一个σ键、两个π键和一个δ键"
print(tokenizer.get_offset_mapping(sentence))
这个 pr 目前正在紧急开发中,你们有任何问题都可以在这里沟通交流,如果我这边测试出来确实是 bug,我将会有针对性的修复。 非常期待你们的回复。
@wj-Mcat
from paddlenlp.transformers import AutoTokenizer
test_str = "'台湾obu人民币业务今日启航'"
tokenizer = AutoTokenizer.from_pretrained("uie-base")
print(tokenizer.get_offset_mapping(test_str))
以上是我的场景数据报错如下:
Traceback (most recent call last):
File "/home/xtl/github_download/PaddleNLP/model_zoo/uie/finetune.py", line 292, in <module>
print(tokenizer.get_offset_mapping(test_str))
File "/home/xtl/github_download/PaddleNLP/paddlenlp/transformers/tokenizer_utils.py", line 1404, in get_offset_mapping
return self._get_bert_like_offset_mapping(text)
File "/home/xtl/github_download/PaddleNLP/paddlenlp/transformers/tokenizer_utils.py", line 1371, in _get_bert_like_offset_mapping
start = text[offset:].index(token) + offset
ValueError: substring not found
@noobexplore 使用上面的 PR分支 https://github.com/PaddlePaddle/PaddleNLP/pull/4003 的代码即可解决你的问题:
from paddlenlp.transformers import AutoTokenizer
test_str = "台湾obu人民币业务今日启航"
tokenizer = AutoTokenizer.from_pretrained("ernie-1.0-base-zh")
tokens = tokenizer.tokenize(test_str, add_special_tokens = False)
print(tokens)
# ['台', '湾', '[UNK]', '人', '民', '币', '业', '务', '今', '日', '启', '航']
offset_mappings = tokenizer.get_offset_mapping(test_str)
print(offset_mappings)
# [(0, 1), (1, 2), (2, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14)]
请关注上面的PR 最新动态。
OK thanks
@wj-Mcat 请问目前2.4.4版本对该问题有解决方案吗?
@Viserion-nlper 2.4.4 版本已经发版本,这个 feature 是在 2.4.5 版本当中:这周五会发版。
所以你可以等周五晚上来下载最新版本,或者直接 clone develop 分支的代码来获得最新 feature。
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。
请提出你的问题
报错如下: