PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.98k stars 2.92k forks source link

[Question]: 端到端语义检索系统,基于DuReader-Robust数据集搭建语义检索系统的代码示例 执行失败 #4948

Closed huangbo-bo closed 6 months ago

huangbo-bo commented 1 year ago

请提出你的问题

执行命令 python examples/semantic-search/semantic_search_example.py --device cpu --search_engine faiss

报错 INFO - pipelines.utils.preprocessing - Converting data/dureader_dev/dureader_dev/dev250.txt INFO - pipelines.utils.preprocessing - Converting data/dureader_dev/dureader_dev/dev716.txt INFO - pipelines.utils.preprocessing - Converting data/dureader_dev/dureader_dev/dev62.txt INFO - pipelines.utils.preprocessing - Converting data/dureader_dev/dureader_dev/dev1165.txt INFO - pipelines.document_stores.faiss - document_cnt:0 embedding_cnt:0 Writing Documents: 2000it [00:04, 492.30it/s]
INFO - pipelines.utils.common_utils - Using devices: PLACE(CPU) INFO - pipelines.utils.common_utils - Number of GPUs: 0 Traceback (most recent call last): File "examples/semantic-search/semantic_search_example.py", line 185, in semantic_search_tutorial() File "examples/semantic-search/semantic_search_example.py", line 164, in semantic_search_tutorial retriever = get_faiss_retriever(use_gpu) File "examples/semantic-search/semantic_search_example.py", line 87, in get_faiss_retriever embed_title=False, File "/usr/local/python3/lib/python3.7/site-packages/pipelines/nodes/retriever/dense.py", line 154, in init share_parameters=share_parameters, TypeError: init() got an unexpected keyword argument 'output_emb_size'

环境 系统:CentOS Linux release 7.9.2009 (Core) python:3.7.16 pip:23.0.1

Package Version


aiohttp 3.8.4 aiosignal 1.3.1 altair 4.2.2 anyio 3.6.2 astor 0.8.1 async-timeout 4.0.2 asynctest 0.13.0 attrdict 2.0.1 attrs 22.2.0 Babel 2.11.0 backports.lzma 0.0.14 backports.zoneinfo 0.2.1 bce-python-sdk 0.8.79 beautifulsoup4 4.11.2 blinker 1.5 cachetools 5.3.0 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 3.0.1 click 8.0.0 colorama 0.4.6 colorlog 6.7.0 cryptography 39.0.1 cssselect 1.2.0 cssutils 2.6.0 cycler 0.11.0 Cython 0.29.33 datasets 2.9.0 decorator 5.1.1 dill 0.3.4 elasticsearch 7.10.0 entrypoints 0.4 et-xmlfile 1.1.0 faiss-cpu 1.7.3 fastapi 0.92.0 filelock 3.9.0 fire 0.5.0 Flask 2.2.3 Flask-Babel 2.0.0 fonttools 4.38.0 frozenlist 1.3.3 fsspec 2023.1.0 future 0.18.3 gitdb 4.0.10 GitPython 3.1.31 greenlet 2.0.2 grpcio 1.47.2 grpcio-tools 1.47.2 h11 0.14.0 htbuilder 0.6.1 huggingface-hub 0.12.1 idna 3.4 imageio 2.25.1 imgaug 0.4.0 importlib-metadata 6.0.0 importlib-resources 5.12.0 itsdangerous 2.1.2 jieba 0.42.1 Jinja2 3.1.2 joblib 1.2.0 jsonschema 4.17.3 kiwisolver 1.4.4 langdetect 1.0.9 llvmlite 0.39.1 lmdb 1.4.0 lxml 4.9.2 Markdown 3.4.1 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.5.3 mdurl 0.1.2 mmh3 3.0.0 more-itertools 9.0.0 multidict 6.0.4 multiprocess 0.70.12.2 networkx 2.6.3 nltk 3.8.1 numba 0.56.4 numpy 1.21.6 opencv-contrib-python 4.6.0.66 opencv-contrib-python-headless 4.7.0.68 opencv-python 4.6.0.66 opencv-python-headless 4.7.0.68 openpyxl 3.1.1 opt-einsum 3.3.0 packaging 23.0 paddle-bfloat 0.1.7 paddle-pipelines 0.4.0 paddle2onnx 1.0.5 paddlefsl 1.1.0 paddlenlp 2.4.3 paddleocr 2.6.1.3 paddlepaddle 2.4.2 pandas 1.3.5 pdf2docx 0.5.6 pdfminer.six 20221105 pdfplumber 0.8.0 Pillow 9.4.0 pip 23.0.1 pkgutil_resolve_name 1.3.10 premailer 3.10.0 protobuf 3.20.0 pyarrow 11.0.0 pyclipper 1.3.0.post4 pycparser 2.21 pycryptodome 3.17 pydantic 1.10.5 pydeck 0.8.0 Pygments 2.14.0 pymilvus 2.2.2 Pympler 1.0.1 PyMuPDF 1.20.2 pyparsing 3.0.9 pyrsistent 0.19.3 python-dateutil 2.8.2 python-docx 0.8.11 python-multipart 0.0.5 pytz 2022.7.1 pytz-deprecation-shim 0.1.0.post0 PyWavelets 1.3.0 PyYAML 6.0 rapidfuzz 2.13.7 regex 2022.10.31 requests 2.28.2 responses 0.18.0 rich 13.3.1 scikit-image 0.19.3 scikit-learn 1.0.2 scipy 1.7.3 semver 2.13.0 sentencepiece 0.1.97 seqeval 1.2.2 setuptools 47.1.0 shapely 2.0.1 six 1.16.0 smmap 5.0.0 sniffio 1.3.0 soupsieve 2.4 SQLAlchemy 1.4.46 SQLAlchemy-Utils 0.40.0 st-annotated-text 3.0.0 starlette 0.25.0 streamlit 1.9.0 termcolor 2.2.0 threadpoolctl 3.1.0 tifffile 2021.11.2 toml 0.10.2 toolz 0.12.0 tornado 6.2 tqdm 4.64.1 typer 0.7.0 typing_extensions 4.5.0 tzdata 2022.7 tzlocal 4.2 ujson 5.4.0 urllib3 1.26.14 uvicorn 0.20.0 validators 0.20.0 visualdl 2.4.2 Wand 0.6.11 watchdog 2.2.1 Werkzeug 2.2.3 xxhash 3.2.0 yarl 1.8.2 zipp 3.14.0

w5688414 commented 1 year ago

请升级paddlenlp至2.5.1

huangbo-bo commented 1 year ago

paddlenlp

升级paddlenlp至2.5.1后, 又报其他的错了

INFO - pipelines.utils.preprocessing - Converting data/dureader_dev/dureader_dev/dev1165.txt INFO - pipelines.document_stores.faiss - document_cnt:0 embedding_cnt:0 Writing Documents: 2000it [00:03, 518.72it/s]
INFO - pipelines.utils.common_utils - Using devices: PLACE(CPU) INFO - pipelines.utils.common_utils - Number of GPUs: 0 [2023-02-23 09:31:59,589] [ INFO] - Model config ErnieConfig { "attention_probs_dropout_prob": 0.1, "enable_recompute": false, "fuse": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 312, "initializer_range": 0.02, "intermediate_size": 1248, "layer_norm_eps": 1e-12, "max_position_embeddings": 2048, "model_type": "ernie", "num_attention_heads": 12, "num_hidden_layers": 4, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 16, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000 }

[2023-02-23 09:31:59,589] [ INFO] - Found /root/.paddlenlp/models/rocketqa-zh-nano-query-encoder/rocketqa-zh-nano-query-encoder.pdparams Traceback (most recent call last): File "examples/semantic-search/semantic_search_example.py", line 185, in semantic_search_tutorial() File "examples/semantic-search/semantic_search_example.py", line 164, in semantic_search_tutorial retriever = get_faiss_retriever(use_gpu) File "examples/semantic-search/semantic_search_example.py", line 87, in get_faiss_retriever embed_title=False, File "/usr/local/python3/lib/python3.7/site-packages/pipelines/nodes/retriever/dense.py", line 154, in init share_parameters=share_parameters, File "/usr/local/python3/lib/python3.7/site-packages/paddlenlp/transformers/semantic_search/modeling.py", line 97, in init self.query_ernie = ErnieEncoder.from_pretrained(query_model_name_or_path, output_emb_size=output_emb_size) File "/usr/local/python3/lib/python3.7/site-packages/paddlenlp/transformers/model_utils.py", line 486, in from_pretrained pretrained_model_name_or_path, from_hf_hub=from_hf_hub, subfolder=subfolder, *args, kwargs File "/usr/local/python3/lib/python3.7/site-packages/paddlenlp/transformers/model_utils.py", line 1346, in from_pretrained_v2 support_conversion=support_conversion, File "/usr/local/python3/lib/python3.7/site-packages/paddlenlp/transformers/model_utils.py", line 1026, in _resolve_model_file_path weight_file_path = get_path_from_url_with_filelock(pretrained_model_name_or_path, cache_dir) File "/usr/local/python3/lib/python3.7/site-packages/paddlenlp/utils/downloader.py", line 167, in get_path_from_url_with_filelock result = get_path_from_url(url=url, root_dir=root_dir, md5sum=md5sum, check_exist=check_exist) File "/usr/local/python3/lib/python3.7/site-packages/paddlenlp/utils/downloader.py", line 134, in get_path_from_url if tarfile.is_tarfile(fullpath) or zipfile.is_zipfile(fullpath): File "/usr/local/python3/lib/python3.7/tarfile.py", line 2442, in is_tarfile t = open(name) File "/usr/local/python3/lib/python3.7/tarfile.py", line 1575, in open return func(name, "r", fileobj, kwargs) File "/usr/local/python3/lib/python3.7/tarfile.py", line 1702, in xzopen t = cls.taropen(name, mode, fileobj, kwargs) File "/usr/local/python3/lib/python3.7/tarfile.py", line 1623, in taropen return cls(name, mode, fileobj, kwargs) File "/usr/local/python3/lib/python3.7/tarfile.py", line 1486, in init self.firstmember = self.next() File "/usr/local/python3/lib/python3.7/tarfile.py", line 2289, in next tarinfo = self.tarinfo.fromtarfile(self) File "/usr/local/python3/lib/python3.7/tarfile.py", line 1094, in fromtarfile buf = tarfile.fileobj.read(BLOCKSIZE) File "/usr/local/python3/lib/python3.7/lzma.py", line 206, in read return self._buffer.read(size) File "/usr/local/python3/lib/python3.7/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/usr/local/python3/lib/python3.7/_compression.py", line 96, in read if self._decompressor.needs_input: AttributeError: '_lzma.LZMADecompressor' object has no attribute 'needs_input'

w5688414 commented 1 year ago

试试把原来的模型删除

rm -rf ~/.paddlenlp