PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.98k stars 2.92k forks source link

[Question]: PaddleNLP/pipelines/examples/semantic-search里的快速搭建语义检索系统遇到TypeError问题 #5163

Closed deepLwcg closed 6 months ago

deepLwcg commented 1 year ago

请提出你的问题

系统环境:

按照步骤到3.4.2 文档数据写入 ANN 索引库发生问题

#以DuReader-Robust 数据集为例建立 ANN 索引库
python utils/offline_ann.py --index_name dureader_robust_query_encoder \
                            --doc_dir data/dureader_dev \
                            --search_engine elastic \
                            --embed_title True \
                            --delete_index

[2023-03-09 12:38:24,769] [ INFO] - Special tokens file saved in /home/nullht/.paddlenlp/models/rocketqa-zh-nano-para-encoder/special_tokens_map.json INFO - pipelines.utils.logger - Logged parameters: {'processor': 'TextSimilarityProcessor', 'tokenizer': 'NoneType', 'max_seq_len': '0', 'dev_split': '0.1'} INFO - pipelines.document_stores.elasticsearch - Updating embeddings for all 1398 docs ... Updating embeddings: 0%| | 0/1398 [00:00<?, ? Docs/sException in thread Thread-3: | 0/1408 [00:00<?, ? Docs/s] TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop batch = self._dataset_fetcher.fetch(indices, File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch data = self.collate_fn(data) File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/pipelines/nodes/retriever/dense.py", line 280, in token_padding_inputs input_ids = Pad(axis=0, pad_val=self.passage_tokenizer.pad_token_id, dtype="int64")(input_ids) File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/paddlenlp/data/collate.py", line 150, in call ret[i] = arr ValueError: setting an array element with a sequence.

w5688414 commented 1 year ago

请提出你的问题

系统环境:

  • python 3.9.16
  • paddlenlp 2.5.2
  • paddle-pipelines 0.5.0
  • paddlepaddle-gpu 2.4.2.post117
  • CUDA Version 11.7
  • NVIDIA Driver Version 515.43.04
  • Ubuntu 22.04 LTS

按照步骤到3.4.2 文档数据写入 ANN 索引库发生问题

#以DuReader-Robust 数据集为例建立 ANN 索引库
python utils/offline_ann.py --index_name dureader_robust_query_encoder \
                            --doc_dir data/dureader_dev \
                            --search_engine elastic \
                            --embed_title True \
                            --delete_index

[2023-03-09 12:38:24,769] [ INFO] - Special tokens file saved in /home/nullht/.paddlenlp/models/rocketqa-zh-nano-para-encoder/special_tokens_map.json INFO - pipelines.utils.logger - Logged parameters: {'processor': 'TextSimilarityProcessor', 'tokenizer': 'NoneType', 'max_seq_len': '0', 'dev_split': '0.1'} INFO - pipelines.document_stores.elasticsearch - Updating embeddings for all 1398 docs ... Updating embeddings: 0%| | 0/1398 [00:00<?, ? Docs/sException in thread Thread-3: | 0/1408 [00:00<?, ? Docs/s] TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/threading.py", line 917, in run self._target(*self._args, self._kwargs) File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop batch = self._dataset_fetcher.fetch(indices, File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch data = self.collate_fn(data) File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/pipelines/nodes/retriever/dense.py", line 280, in token_padding_inputs input_ids = Pad(axis=0, pad_val=self.passage_tokenizer.pad_token_id, dtype="int64")(input_ids) File "/home/nullht/anaconda3/envs/nlp/lib/python3.9/site-packages/paddlenlp/data/collate.py", line 150, in call** ret[i] = arr ValueError: setting an array element with a sequence.

请安装最新develop版本,或者把embedding_title参数设置为False