PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.1k stars 2.94k forks source link

[Bug]: 端到端语义检索系统,基于DuReader-Robust数据集搭建语义检索系统报sqlite3.OperationalError #6233

Closed suntao2015005848 closed 4 months ago

suntao2015005848 commented 1 year ago

软件环境

- paddlepaddle-gpu: 2.3.2.post112
- paddlenlp: 2.5.2
- liunx: Centos7.6
- python 3.7.5   
- sqlite3  3.7.17
其他环境:
Package                        Version
------------------------------ -------------
absl-py                        0.11.0
accelerate                     0.20.3
aiofiles                       23.1.0
aiohttp                        3.8.3
aiosignal                      1.3.1
altair                         5.0.1
amqp                           5.0.6
anyio                          3.6.2
apache-libcloud                3.3.1
appdirs                        1.4.4
asgiref                        3.3.4
astor                          0.8.1
async-timeout                  4.0.2
asynctest                      0.13.0
attrdict                       2.0.1
attrs                          22.2.0
auto-labeling-pipeline         0.1.15
Babel                          2.9.0
backports.zoneinfo             0.2.1
bce-python-sdk                 0.8.53
beautifulsoup4                 4.12.2
billiard                       3.6.4.0
blinker                        1.6.2
boilerpy3                      1.0.6
boto3                          1.17.55
botocore                       1.20.55
cached-property                1.5.2
cachetools                     4.2.1
celery                         5.0.5
certifi                        2020.12.5
cffi                           1.14.5
cfgv                           3.2.0
chardet                        4.0.0
charset-normalizer             2.1.1
click                          8.0.0
click-didyoumean               0.0.3
click-plugins                  1.1.1
click-repl                     0.1.6
cma                            2.7.0
colorama                       0.4.4
colorlog                       4.7.2
colour                         0.1.5
conllu                         4.4
coreapi                        2.3.3
coreschema                     0.0.4
cors                           1.0.1
cpm-kernels                    1.0.11
cryptography                   41.0.1
cssselect                      1.1.0
cssutils                       2.3.0
cycler                         0.10.0
Cython                         0.29.35
datasets                       2.11.0
ddparser                       0.1.2
decorator                      4.4.2
defusedxml                     0.7.1
dill                           0.3.4
distlib                        0.3.1
dj-database-url                0.5.0
dj-rest-auth                   2.1.4
Django                         3.2
django-celery-results          2.0.1
django-cors-headers            3.7.0
django-drf-filepond            0.3.0
django-filter                  2.4.0
django-polymorphic             3.0.0
django-rest-polymorphic        0.1.9
django-storages                1.11.1
djangorestframework            3.12.4
djangorestframework-csv        2.1.0
djangorestframework-xml        2.0.0
drf-yasg                       1.20.0
easydict                       1.9
ecdsa                          0.14.1
elasticsearch                  7.10.0
environs                       9.3.2
et-xmlfile                     1.0.1
Events                         0.4
faiss-cpu                      1.7.4
fastapi                        0.95.1
ffmpy                          0.3.0
filelock                       3.0.12
fire                           0.5.0
flake8                         3.8.4
Flask                          2.2.2
Flask-Babel                    2.0.0
Flask-Cors                     3.0.10
fonttools                      4.38.0
frozenlist                     1.3.3
fsspec                         2023.1.0
funcsigs                       1.0.2
furl                           2.1.2
future                         0.18.2
gast                           0.3.3
gevent                         22.10.2
gitdb                          4.0.7
GitPython                      3.1.18
glibc                          0.6.1
google-auth                    1.27.1
google-auth-oauthlib           0.4.3
gradio                         3.34.0
gradio_client                  0.2.6
graphviz                       0.16
greenlet                       2.0.2
grpcio                         1.53.0
gunicorn                       20.1.0
h11                            0.14.0
h5py                           3.3.0
htbuilder                      0.6.1
httpcore                       0.17.2
httpx                          0.24.1
huggingface-hub                0.15.1
identify                       2.1.0
idna                           2.10
imageio                        2.9.0
imgaug                         0.4.0
importlib-metadata             3.7.2
importlib-resources            5.12.0
inference                      0.1
inflection                     0.5.1
install                        1.3.5
itsdangerous                   2.1.2
itypes                         1.2.0
jieba                          0.42.1
Jinja2                         3.1.2
jmespath                       0.10.0
joblib                         1.0.1
jsonschema                     4.17.3
kiwisolver                     1.3.1
kombu                          5.0.2
LAC                            2.1.1
langdetect                     1.0.9
latex2mathml                   3.75.2
linkify-it-py                  2.0.2
llvmlite                       0.39.1
lmdb                           1.1.1
lml                            0.1.0
lxml                           4.6.3
Markdown                       3.3.4
markdown-it-py                 2.2.0
MarkupSafe                     2.1.2
marshmallow                    3.11.1
matplotlib                     3.3.4
mccabe                         0.6.1
mdit-py-plugins                0.3.3
mdtex2html                     1.2.0
mdurl                          0.1.2
mmh3                           4.0.0
more-itertools                 9.1.0
multidict                      6.0.4
multiprocess                   0.70.12.2
networkx                       2.5
nltk                           3.5
nodeenv                        1.5.0
numba                          0.56.4
numpy                          1.19.5
nvidia-cuda-nvrtc-cu11         11.7.99
nvidia-cuda-runtime-cu11       11.7.99
nvidia-cudnn-cu11              8.5.0.96
oauthlib                       3.1.0
objgraph                       3.5.0
openai                         0.26.5
opencv-contrib-python          4.4.0.46
opencv-contrib-python-headless 4.7.0.72
opencv-python                  4.6.0.66
openpyxl                       3.0.7
opt-einsum                     3.3.0
orderedmultidict               1.0.1
orjson                         3.9.1
packaging                      20.9
paddle-bfloat                  0.1.7
paddle-pipelines               0.5.3
paddle2onnx                    1.0.6
paddlefsl                      1.1.0
paddlehub                      2.1.0
paddlenlp                      2.5.2
paddleocr                      2.6.1.3
paddlepaddle-gpu               2.3.2.post112
pandas                         1.3.5
pathlib                        1.0.1
pdf2docx                       0.5.6
pdf2image                      1.16.3
pdfminer.six                   20221105
pdfplumber                     0.9.0
Pillow                         9.5.0
pip                            23.1.2
pipenv                         2020.11.15
pkgutil_resolve_name           1.3.10
pre-commit                     2.11.0
premailer                      3.10.0
prettytable                    2.1.0
prompt-toolkit                 3.0.18
protobuf                       3.20.0
psutil                         5.9.5
pyarrow                        11.0.0
pyasn1                         0.4.8
pyasn1-modules                 0.2.8
pyclipper                      1.2.1
pycodestyle                    2.6.0
pycparser                      2.20
pycryptodome                   3.10.1
pydantic                       1.10.7
pydeck                         0.8.1b0
pydub                          0.25.1
pyexcel                        0.6.6
pyexcel-io                     0.6.4
pyexcel-xlsx                   0.6.0
pyflakes                       2.2.0
Pygments                       2.15.1
PyJWT                          2.0.1
pymilvus                       2.2.12
Pympler                        1.0.1
PyMuPDF                        1.20.2
pyparsing                      2.4.7
pyrsistent                     0.19.3
PySocks                        1.7.1
python-dateutil                2.8.1
python-docx                    0.8.11
python-dotenv                  0.17.0
python-jose                    3.2.0
python-Levenshtein             0.12.2
python-multipart               0.0.6
python3-openid                 3.2.0
pytils                         0.3
pytz                           2021.1
pytz-deprecation-shim          0.1.0.post0
PyWavelets                     1.1.1
PyYAML                         5.4.1
pyzmq                          18.1.1
rapidfuzz                      3.1.1
rarfile                        4.0
regex                          2020.11.13
requests                       2.25.1
requests-file                  1.5.1
requests-oauthlib              1.3.0
responses                      0.18.0
rich                           13.3.4
rsa                            4.7.2
ruamel.yaml                    0.17.4
ruamel.yaml.clib               0.2.2
s3transfer                     0.4.0
scikit-image                   0.17.2
scikit-learn                   0.24.1
scipy                          1.3.1
semantic-version               2.10.0
semver                         3.0.1
sentencepiece                  0.1.95
seqeval                        1.2.2
setuptools                     68.0.0
Shapely                        1.7.1
shellcheck-py                  0.7.1.1
shortuuid                      1.0.1
six                            1.15.0
smmap                          4.0.0
sniffio                        1.3.0
social-auth-app-django         4.0.0
social-auth-core               4.1.0
soupsieve                      2.4.1
SQLAlchemy                     1.4.11
SQLAlchemy-Utils               0.41.1
sqlparse                       0.4.1
sseclient-py                   1.7.2
st-annotated-text              4.0.0
starlette                      0.26.1
streamlit                      1.11.1
tenacity                       8.2.2
tensorboard                    2.4.1
tensorboard-plugin-wit         1.8.0
termcolor                      2.3.0
texttable                      1.6.3
threadpoolctl                  2.1.0
tifffile                       2021.3.17
tldextract                     3.4.0
tokenizers                     0.13.3
toml                           0.10.2
tools                          0.1.9
toolz                          0.12.0
torch                          1.13.1
tornado                        6.2
tqdm                           4.65.0
transformers                   4.27.1
typer                          0.7.0
typing_extensions              4.5.0
tzdata                         2023.3
tzlocal                        4.3
uc-micro-py                    1.0.2
ujson                          5.7.0
unicodecsv                     0.14.1
uritemplate                    3.0.1
urllib3                        1.26.3
uvicorn                        0.21.1
validators                     0.20.0
vine                           5.0.0
virtualenv                     20.4.2
virtualenv-clone               0.5.4
visualdl                       2.1.1
waitress                       2.1.2
Wand                           0.6.11
watchdog                       3.0.0
wcwidth                        0.2.5
websockets                     11.0.3
Werkzeug                       2.2.2
wheel                          0.36.2
whitenoise                     5.2.0
wordcloud                      1.8.2.2
xxhash                         3.2.0
yapf                           0.26.0
yarl                           1.8.2
zipp                           3.4.1
zope.event                     4.6
zope.interface                 5.5.2

重复问题

错误描述

参考项目链接:https://aistudio.baidu.com/aistudio/projectdetail/4442670?channelType=0&channel=0
在测试端到端的语义检索时候出现sqlite3.OperationalError的语法报错,尝试过升级paddle-pipelines从0.1.0到0.5.3
,任然有该问题。并且执行官方git快速开始的案例,也报同样的错。但是再aistudio环境测试并无该bug出现。

稳定复现步骤 & 代码

在我的机器上代码在执行到: 使用retriever抽取文本的向量,然后更新到faiss中 document_store.update_embeddings(retriever) 出现报错:

INFO - pipelines.document_stores.faiss -  Updating embeddings for 1398 docs...
Updating Embedding:   0%|                                                                                                                                                                                                     | 0/1398 [00:00<?, ? docs/s]
Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1707, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: near "(": syntax error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./examples/semantic-search/semantic_search_example.py", line 185, in <module>
    semantic_search_tutorial()
  File "./examples/semantic-search/semantic_search_example.py", line 164, in semantic_search_tutorial
    retriever = get_faiss_retriever(use_gpu)
  File "./examples/semantic-search/semantic_search_example.py", line 91, in get_faiss_retriever
    document_store.update_embeddings(retriever)
  File "/usr/local/python3/lib/python3.7/site-packages/pipelines/document_stores/faiss.py", line 377, in update_embeddings
    for document_batch in batched_documents:
  File "/usr/local/python3/lib/python3.7/site-packages/pipelines/document_stores/base.py", line 673, in get_batches_from_generator
    x = tuple(islice(it, n))
  File "/usr/local/python3/lib/python3.7/site-packages/pipelines/document_stores/sql.py", line 327, in _query
    for i, row in enumerate(documents_query, start=1):
  File "/usr/local/python3/lib/python3.7/site-packages/pipelines/document_stores/sql.py", line 775, in _windowed_query
    for whereclause in self._column_windows(q.session, column, windowsize):
  File "/usr/local/python3/lib/python3.7/site-packages/pipelines/document_stores/sql.py", line 762, in _column_windows
    intervals = [id for id, in q]
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2827, in __iter__
    return self._iter().__iter__()
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2837, in _iter
    execution_options={"_sa_orm_load_options": self.load_options},
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1670, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1521, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 314, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1400, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1750, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1931, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1707, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/python3/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) near "(": syntax error
[SQL: SELECT anon_1.document_id AS anon_1_document_id 
FROM (SELECT document.id AS document_id, row_number() OVER (ORDER BY document.id) AS rownum 
FROM document) AS anon_1 
WHERE rownum % 10000=1]
(Background on this error at: http://sqlalche.me/e/14/e3q8)
suntao2015005848 commented 1 year ago

目前测试发现是sqlite版本的问题,python3.7.4 SQLite 3.30.0 可执行该语句:cursor.execute('SELECT anon_1.document_id AS anon_1_document_id FROM (SELECT document.id AS document_id, row_number() OVER (ORDER BY document.id) AS rownum FROM document) AS anon_1 WHERE rownum % 10000=1') 而 python3.7.5 SQLite 3.7.17 不能执行该sql,报错如上,希望开发人员后续关注一下这个bug。毕竟切换环境比较麻烦。

suntao2015005848 commented 1 year ago

image image

w5688414 commented 1 year ago

image image

感谢您的反馈,欢迎贡献一个修复pr