👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
在搭建端到端语义检索系统时,在执行到
3.4.2 文档数据写入 ANN 索引库
时,报如下错误:
[2023-08-15 14:37:11,246] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'rocketqa-zh-nano-query-encoder'.
[2023-08-15 14:37:11,247] [ INFO] - Already cached /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/ernie_3.0_nano_zh_vocab.txt
[2023-08-15 14:37:11,296] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/tokenizer_config.json
[2023-08-15 14:37:11,297] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/special_tokens_map.json
[2023-08-15 14:37:12,117] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'rocketqa-zh-nano-para-encoder'.
[2023-08-15 14:37:12,118] [ INFO] - Already cached /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/ernie_3.0_nano_zh_vocab.txt
[2023-08-15 14:37:12,180] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/tokenizer_config.json
[2023-08-15 14:37:12,181] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/special_tokens_map.json
INFO - pipelines.document_stores.elasticsearch - Updating embeddings for all 1398 docs ...
Updating embeddings: 0%| | 0/1398 [00:31<?, ? Docs/s]
Traceback (most recent call last):
File "/home/project/PaddleNLP/pipelines/utils/offline_ann.py", line 129, in
offline_ann(args.index_name, args.doc_dir)
File "/home/project/PaddleNLP/pipelines/utils/offline_ann.py", line 98, in offline_ann
document_store.update_embeddings(retriever)
File "/usr/local/lib/python3.9/site-packages/pipelines/document_stores/elasticsearch.py", line 1499, in update_embeddings
bulk(self.client, doc_updates, request_timeout=300, refresh=self.refresh_type, headers=headers)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 390, in bulk
for ok, item in streaming_bulk(client, actions, *args, kwargs):
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 311, in streaming_bulk
for data, (ok, info) in zip(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 247, in _process_bulk_chunk
for item in gen:
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 196, in _process_bulk_chunk_error
raise error
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 235, in _process_bulk_chunk
resp = client.bulk("\n".join(bulk_actions) + "\n", *args, *kwargs)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
return func(args, params=params, headers=headers, kwargs)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/init.py", line 455, in bulk
return self.transport.perform_request(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 392, in perform_request
raise e
File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 358, in perform_request
status, headers_response, data = connection.perform_request(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 269, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/base.py", line 312, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, 'x_content_parse_exception', '[1:22] [UpdateRequest] failed to parse field [doc]')
Linux version 5.14.0-226.el9.x86_64 (mockbuild@x86-05.stream.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-33.el9) #1 SMP PREEMPT_DYNAMIC Fri Dec 23 16:16:30 UTC 2022
请提出你的问题
在搭建端到端语义检索系统时,在执行到 3.4.2 文档数据写入 ANN 索引库 时,报如下错误: [2023-08-15 14:37:11,246] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'rocketqa-zh-nano-query-encoder'. [2023-08-15 14:37:11,247] [ INFO] - Already cached /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/ernie_3.0_nano_zh_vocab.txt [2023-08-15 14:37:11,296] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/tokenizer_config.json [2023-08-15 14:37:11,297] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/special_tokens_map.json [2023-08-15 14:37:12,117] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'rocketqa-zh-nano-para-encoder'. [2023-08-15 14:37:12,118] [ INFO] - Already cached /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/ernie_3.0_nano_zh_vocab.txt [2023-08-15 14:37:12,180] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/tokenizer_config.json [2023-08-15 14:37:12,181] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/rocketqa-zh-nano-para-encoder/special_tokens_map.json INFO - pipelines.document_stores.elasticsearch - Updating embeddings for all 1398 docs ... Updating embeddings: 0%| | 0/1398 [00:31<?, ? Docs/s] Traceback (most recent call last): File "/home/project/PaddleNLP/pipelines/utils/offline_ann.py", line 129, in
offline_ann(args.index_name, args.doc_dir)
File "/home/project/PaddleNLP/pipelines/utils/offline_ann.py", line 98, in offline_ann
document_store.update_embeddings(retriever)
File "/usr/local/lib/python3.9/site-packages/pipelines/document_stores/elasticsearch.py", line 1499, in update_embeddings
bulk(self.client, doc_updates, request_timeout=300, refresh=self.refresh_type, headers=headers)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 390, in bulk
for ok, item in streaming_bulk(client, actions, *args, kwargs):
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 311, in streaming_bulk
for data, (ok, info) in zip(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 247, in _process_bulk_chunk
for item in gen:
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 196, in _process_bulk_chunk_error
raise error
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 235, in _process_bulk_chunk
resp = client.bulk("\n".join(bulk_actions) + "\n", *args, *kwargs)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
return func(args, params=params, headers=headers, kwargs)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/init.py", line 455, in bulk
return self.transport.perform_request(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 392, in perform_request
raise e
File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 358, in perform_request
status, headers_response, data = connection.perform_request(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 269, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/base.py", line 312, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, 'x_content_parse_exception', '[1:22] [UpdateRequest] failed to parse field [doc]')
参考的部署说明: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/semantic-search
使用的系统信息及软件版本: 操作系统: CentOS Stream release 9
Linux version 5.14.0-226.el9.x86_64 (mockbuild@x86-05.stream.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-33.el9) #1 SMP PREEMPT_DYNAMIC Fri Dec 23 16:16:30 UTC 2022
python: 3.9.16
paddle-bfloat 0.1.7 paddle-pipelines 0.5.2 paddle2onnx 1.0.5 paddleaudio 1.1.0 paddlefsl 1.1.0 paddlenlp 2.6.0rc0 paddleocr 2.6.1.3 paddlepaddle 2.5.1 paddleslim 2.4.1 paddlespeech 1.4.1 paddlespeech-feat 0.1.0
Elasticsearch:8.7.1
总结:数据是能写入到ES数据库中,但在更新文档时,会报以上的错误。
烦请解决一下,谢谢!!