[Question] 使用tokenizer报错

zcg-joker commented 1 year ago

请提出你的问题

使用：encoded_inputs = tokenizer(text=[example["prompt"]],
                           text_pair=[example["content"]],
                           truncation=True,
                           max_seq_len=max_seq_len,
                           pad_to_max_seq_len=True,
                           return_attention_mask=True,
                           return_position_ids=True,
                           return_dict=False,
                           return_offsets_mapping=True)

报错如下：

  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2278, in __call__
    **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2484, in batch_encode
    **kwargs,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1147, in _batch_encode_plus
    **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1295, in _batch_prepare_for_model
    **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2838, in prepare_for_model
    text_pair) if text_pair is not None else None
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1404, in get_offset_mapping
    return self._get_bert_like_offset_mapping(text)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1371, in _get_bert_like_offset_mapping
    start = text[offset:].index(token) + offset
ValueError: substring not found

zcg-joker commented 1 year ago

单独拿出来在notebook里使用就不会报错，能正常返回。

output=tokenizer(text=['被害人'],text_pair=['经审理查明,2012年9月13日5时许,'],truncation=True,
                               max_seq_len=32,
                               pad_to_max_seq_len=True,
                               return_attention_mask=True,
                               return_position_ids=True,
                               return_dict=False,
                               return_offsets_mapping=True)

print(output)

[{'input_ids': [101, 41979, 102, 3845, 15803, 21925, 117, 43161, 1625, 130, 2334, 5681, 2243, 126, 2251, 4611, 117, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'offset_mapping': [(0, 0), (0, 3), (0, 0), (0, 1), (1, 3), (3, 5), (5, 6), (6, 10), (10, 11), (11, 12), (12, 13), (13, 15), (15, 16), (16, 17), (17, 18), (18, 19), (19, 20), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0)], 'position_ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}]

wj-Mcat commented 1 year ago

@zcg-joker 你能给出稳定复现问题的 example 吗？通过上面的描述没办法找到具体问题所在。

ChristmasLatte commented 1 year ago

@zcg-joker 你能给出稳定复现问题的 example 吗？通过上面的描述没办法找到具体问题所在。

我是在执行pipelines的端到端智能问答时遇到了这个报错，环境是用Anaconda新建的虚拟环境，python=3.7.13, paddlepaddle-gpu的安装参考了官网的基于conda安装的命令，硬件版本： cuda: 11.4 nvidia-driver: 470.141.03 显卡：RTX-3090

环境中python包的版本： python=3.7.13 paddlepaddle-gpu=2.4.0.post112 paddlenlp=2.4.4 cudatoolkit=11.2.2 conda-forge cudnn=8.2.1.32 conda-forge

在执行快速https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/question-answering中的3.3一键体验问答系统的命令时，报错如下：

python examples/question-answering/dense_qa_example.py --device gpu
INFO - pipelines.document_stores.faiss -  document_cnt:1318 embedding_cnt:1318
INFO - pipelines.utils.common_utils -  Using devices: PLACE(GPU:0)
INFO - pipelines.utils.common_utils -  Number of GPUs: 1
[2022-11-30 15:48:04,021] [    INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa_zh_dureader_query_encoder.pdparams
W1130 15:48:04.025661 954735 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.4, Runtime API Version: 11.2
W1130 15:48:04.029268 954735 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022-11-30 15:48:07,330] [    INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa_zh_dureader_query_encoder.pdparams
[2022-11-30 15:48:08,133] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'rocketqa-zh-dureader-query-encoder'.
[2022-11-30 15:48:08,134] [    INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa-zh-dureader-vocab.txt
[2022-11-30 15:48:08,144] [    INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/tokenizer_config.json
[2022-11-30 15:48:08,145] [    INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/special_tokens_map.json
[2022-11-30 15:48:08,145] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'rocketqa-zh-dureader-query-encoder'.
[2022-11-30 15:48:08,145] [    INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/rocketqa-zh-dureader-vocab.txt
[2022-11-30 15:48:08,156] [    INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/tokenizer_config.json
[2022-11-30 15:48:08,156] [    INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-query-encoder/special_tokens_map.json
INFO - pipelines.utils.logger -  Logged parameters: 
 {'processor': 'TextSimilarityProcessor', 'tokenizer': 'NoneType', 'max_seq_len': '0', 'dev_split': '0.1'}
INFO - pipelines.utils.common_utils -  Using devices: PLACE(GPU:0)
INFO - pipelines.utils.common_utils -  Number of GPUs: 1
Loading Parameters from:rocketqa-zh-dureader-cross-encoder
[2022-11-30 15:48:08,157] [    INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/rocketqa_zh_dureader_cross_encoder.pdparams
[2022-11-30 15:48:08,933] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'rocketqa-zh-dureader-cross-encoder'.
[2022-11-30 15:48:08,933] [    INFO] - Already cached /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/rocketqa-zh-dureader-vocab.txt
[2022-11-30 15:48:08,944] [    INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/tokenizer_config.json
[2022-11-30 15:48:08,944] [    INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/rocketqa-zh-dureader-cross-encoder/special_tokens_map.json
INFO - pipelines.utils.common_utils -  Using devices: PLACE(GPU:0)
INFO - pipelines.utils.common_utils -  Number of GPUs: 1
[2022-11-30 15:48:08,946] [    INFO] - We are using <class 'paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering'> to load 'ernie-gram-zh-finetuned-dureader-robust'.
[2022-11-30 15:48:08,946] [    INFO] - Already cached /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/model_state.pdparams
[2022-11-30 15:48:09,844] [    INFO] - We are using <class 'paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer'> to load 'ernie-gram-zh-finetuned-dureader-robust'.
[2022-11-30 15:48:09,844] [    INFO] - Already cached /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/vocab.txt
[2022-11-30 15:48:09,854] [    INFO] - tokenizer config file saved in /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/tokenizer_config.json
[2022-11-30 15:48:09,855] [    INFO] - Special tokens file saved in /home/ps/.paddlenlp/models/ernie-gram-zh-finetuned-dureader-robust/special_tokens_map.json
INFO - pipelines.utils.logger -  Logged parameters: 
 {'processor': 'SquadProcessor', 'tokenizer': 'ErnieGramTokenizer', 'max_seq_len': '256', 'dev_split': '0'}
Traceback (most recent call last):
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/base.py", line 464, in run
    "component"]._dispatch_run(**node_input)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 130, in _dispatch_run
    return self._dispatch_run_general(self.run, **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 176, in _dispatch_run_general
    output, stream = run_method(**run_inputs, **run_params)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 109, in run
    results = predict(query=query, documents=documents, top_k=top_k)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 147, in wrapper
    ret = fn(*args, **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/ernie_dureader.py", line 227, in predict
    dicts, indices=indices, return_baskets=True)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/data_handler/processor.py", line 305, in dataset_from_dicts
    indices)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/utils/tokenization.py", line 63, in tokenize_batch_question_answering
    add_special_tokens=False)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2484, in batch_encode
    **kwargs,
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1147, in _batch_encode_plus
    **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1295, in _batch_prepare_for_model
    **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2836, in prepare_for_model
    token_offset_mapping = self.get_offset_mapping(text)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1404, in get_offset_mapping
    return self._get_bert_like_offset_mapping(text)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1371, in _get_bert_like_offset_mapping
    start = text[offset:].index(token) + offset
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "examples/question-answering/dense_qa_example.py", line 132, in <module>
    dense_qa_pipeline()
  File "examples/question-answering/dense_qa_example.py", line 115, in dense_qa_pipeline
    prediction = pipe.run(query="北京市有多少个行政区？", params=pipeline_params)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/standard_pipelines.py", line 229, in run
    output = self.pipeline.run(query=query, params=params, debug=debug)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/base.py", line 468, in run
    f"Exception while running node `{node_id}` with input `{node_input}`: {e}, full stack trace: {tb}"
Exception: Exception while running node `Reader` with input `{'documents': [<Document: {'content': '北京（Beijing），简称“京”，古称燕京、北平，是中华人民共和国的首都、直辖市、国家中心城市、超大城市，国务院批复确定的中国政治中心、文化中心、国际交往中心、科技创新中心，截至2020年，全市下辖16个区，总面积16410.54平方千米。2021年末，北京市常住人口2188.6万人，比上年末减少0.4万人。2021年，北京市全年实现地区生产总值40269.6亿元，按不变价格计算，比上年增长8.5％。全市人均地区生产总值为18.4万元。', 'content_type': 'text', 'score': 0.013789152, 'meta': {'name': '298.txt', 'vector_id': '1132'}, 'embedding': None, 'id': 'dcae69b2307e76f2e8f28f1ff4108a17', 'id_hash_keys': None, '__pydantic_initialised__': True, 'ann_score': 0.9838276330088003, 'rank_score': 0.013789152}>], 'root_node': 'Query', 'params': {'Retriever': {'top_k': 50}, 'Ranker': {'top_k': 1}, 'Reader': {'top_k': 1}}, 'query': '北京市有多少个行政区？', 'node_id': 'Reader'}`: substring not found, full stack trace: Traceback (most recent call last):
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/pipelines/base.py", line 464, in run
    "component"]._dispatch_run(**node_input)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 130, in _dispatch_run
    return self._dispatch_run_general(self.run, **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/base.py", line 176, in _dispatch_run_general
    output, stream = run_method(**run_inputs, **run_params)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 109, in run
    results = predict(query=query, documents=documents, top_k=top_k)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/base.py", line 147, in wrapper
    ret = fn(*args, **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/nodes/reader/ernie_dureader.py", line 227, in predict
    dicts, indices=indices, return_baskets=True)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/data_handler/processor.py", line 305, in dataset_from_dicts
    indices)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/pipelines/utils/tokenization.py", line 63, in tokenize_batch_question_answering
    add_special_tokens=False)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2484, in batch_encode
    **kwargs,
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1147, in _batch_encode_plus
    **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1295, in _batch_prepare_for_model
    **kwargs)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2836, in prepare_for_model
    token_offset_mapping = self.get_offset_mapping(text)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1404, in get_offset_mapping
    return self._get_bert_like_offset_mapping(text)
  File "/home/ps/anaconda3/envs/paddlepaddle/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 1371, in _get_bert_like_offset_mapping
    start = text[offset:].index(token) + offset
ValueError: substring not found

Deeeerek commented 1 year ago

同遇到，异常字符都会报错

zcg-joker commented 1 year ago

我出现问题的场景与这个场景类似

noobexplore commented 1 year ago

我刚也遇到了升级到新的2.4.4之后报同样的错误，而之前的2.4.2并不报错

wj-Mcat commented 1 year ago

@noobexplore @zcg-joker @Deeeerek 我现在正在发起一个 pr 来修复这个问题，可是我不确定这个能否覆盖你们所有的场景，所以我想请你们把你们的场景数据贴出来，用一小段代码来描述问题，就像： https://github.com/PaddlePaddle/PaddleNLP/issues/3985 里面的代码：

from paddlenlp.transformers import BertModel, BertTokenizer, ErnieTokenizer

tokenizer = ErnieTokenizer.from_pretrained("ernie-3.0-base-zh")
sentence = "以σπδ来描述，包括一个σ键、两个π键和一个δ键"
print(tokenizer.get_offset_mapping(sentence))

这个 pr 目前正在紧急开发中，你们有任何问题都可以在这里沟通交流，如果我这边测试出来确实是 bug，我将会有针对性的修复。非常期待你们的回复。

noobexplore commented 1 year ago

@wj-Mcat

from paddlenlp.transformers import AutoTokenizer

test_str = "'台湾ｏｂｕ人民币业务今日启航'"
tokenizer = AutoTokenizer.from_pretrained("uie-base")
print(tokenizer.get_offset_mapping(test_str))

以上是我的场景数据报错如下：

Traceback (most recent call last):
  File "/home/xtl/github_download/PaddleNLP/model_zoo/uie/finetune.py", line 292, in <module>
    print(tokenizer.get_offset_mapping(test_str))
  File "/home/xtl/github_download/PaddleNLP/paddlenlp/transformers/tokenizer_utils.py", line 1404, in get_offset_mapping
    return self._get_bert_like_offset_mapping(text)
  File "/home/xtl/github_download/PaddleNLP/paddlenlp/transformers/tokenizer_utils.py", line 1371, in _get_bert_like_offset_mapping
    start = text[offset:].index(token) + offset
ValueError: substring not found

wj-Mcat commented 1 year ago

@noobexplore 使用上面的 PR分支 https://github.com/PaddlePaddle/PaddleNLP/pull/4003 的代码即可解决你的问题：

from paddlenlp.transformers import AutoTokenizer
test_str = "台湾ｏｂｕ人民币业务今日启航"
tokenizer = AutoTokenizer.from_pretrained("ernie-1.0-base-zh")

tokens = tokenizer.tokenize(test_str, add_special_tokens = False)
print(tokens)
# ['台', '湾', '[UNK]', '人', '民', '币', '业', '务', '今', '日', '启', '航']

offset_mappings = tokenizer.get_offset_mapping(test_str)
print(offset_mappings)
# [(0, 1), (1, 2), (2, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14)]

请关注上面的PR 最新动态。

noobexplore commented 1 year ago

OK thanks

Viserion-nlper commented 1 year ago

@wj-Mcat 请问目前2.4.4版本对该问题有解决方案吗？

wj-Mcat commented 1 year ago

@Viserion-nlper 2.4.4 版本已经发版本，这个 feature 是在 2.4.5 版本当中：这周五会发版。

所以你可以等周五晚上来下载最新版本，或者直接 clone develop 分支的代码来获得最新 feature。

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

PaddlePaddle / PaddleNLP

[Question] 使用tokenizer报错 #3953

请提出你的问题