PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.97k stars 2.91k forks source link

[Question]: 搭建端到端语义检索系统时,报failed to parse field [doc]的错误 #6731

Open myailab opened 1 year ago

myailab commented 1 year ago

请提出你的问题

在搭建端到端语义检索系统时,在执行到 3.4.2 文档数据写入 ANN 索引库 时,报如下错误: [2023-08-15 14:37:11,246] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'rocketqa-zh-nano-query-encoder'. [2023-08-15 14:37:11,247] [ INFO] - Already cached /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/ernie_3.0_nano_zh_vocab.txt [2023-08-15 14:37:11,296] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/tokenizer_config.json [2023-08-15 14:37:11,297] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/special_tokens_map.json [2023-08-15 14:37:12,117] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'rocketqa-zh-nano-para-encoder'. [2023-08-15 14:37:12,118] [ INFO] - Already cached /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/ernie_3.0_nano_zh_vocab.txt [2023-08-15 14:37:12,180] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/tokenizer_config.json [2023-08-15 14:37:12,181] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/special_tokens_map.json INFO - pipelines.document_stores.elasticsearch - Updating embeddings for all 1398 docs ... Updating embeddings: 0%| | 0/1398 [00:31<?, ? Docs/s] Traceback (most recent call last): File "/home/project/PaddleNLP/pipelines/utils/offline_ann.py", line 129, in offline_ann(args.index_name, args.doc_dir) File "/home/project/PaddleNLP/pipelines/utils/offline_ann.py", line 98, in offline_ann document_store.update_embeddings(retriever) File "/usr/local/lib/python3.9/site-packages/pipelines/document_stores/elasticsearch.py", line 1499, in update_embeddings bulk(self.client, doc_updates, request_timeout=300, refresh=self.refresh_type, headers=headers) File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 390, in bulk for ok, item in streaming_bulk(client, actions, *args, kwargs): File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 311, in streaming_bulk for data, (ok, info) in zip( File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 247, in _process_bulk_chunk for item in gen: File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 196, in _process_bulk_chunk_error raise error File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 235, in _process_bulk_chunk resp = client.bulk("\n".join(bulk_actions) + "\n", *args, *kwargs) File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped return func(args, params=params, headers=headers, kwargs) File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/init.py", line 455, in bulk return self.transport.perform_request( File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 392, in perform_request raise e File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 358, in perform_request status, headers_response, data = connection.perform_request( File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 269, in perform_request self._raise_error(response.status, raw_data) File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/base.py", line 312, in _raise_error raise HTTP_EXCEPTIONS.get(status_code, TransportError)( elasticsearch.exceptions.RequestError: RequestError(400, 'x_content_parse_exception', '[1:22] [UpdateRequest] failed to parse field [doc]')

参考的部署说明: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/semantic-search

使用的系统信息及软件版本: 操作系统: CentOS Stream release 9

Linux version 5.14.0-226.el9.x86_64 (mockbuild@x86-05.stream.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-33.el9) #1 SMP PREEMPT_DYNAMIC Fri Dec 23 16:16:30 UTC 2022

python: 3.9.16

paddle-bfloat 0.1.7 paddle-pipelines 0.5.2 paddle2onnx 1.0.5 paddleaudio 1.1.0 paddlefsl 1.1.0 paddlenlp 2.6.0rc0 paddleocr 2.6.1.3 paddlepaddle 2.5.1 paddleslim 2.4.1 paddlespeech 1.4.1 paddlespeech-feat 0.1.0

Elasticsearch:8.7.1

总结:数据是能写入到ES数据库中,但在更新文档时,会报以上的错误。

烦请解决一下,谢谢!!

w5688414 commented 1 year ago

es换成8.3.3呢?

gvalbio commented 12 months ago

问题解决了吗?我也遇到了,docker中部署的es

ErwinOVO commented 8 months ago

我也遇到了这个问题,应该怎么办