deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.82k stars 1.84k forks source link

ZMQError: Address already in use when using the multiprocessing library and importing haystack modules #3625

Closed burtonrj closed 1 year ago

burtonrj commented 1 year ago

Describe the bug

There seems to be some issue with multiprocessing in Python and haystack.

If I import the multiprocessing library and don't import any haystack modules, I can run the following code without any error:

from multiprocessing import Pool, cpu_count
from tqdm.notebook import tqdm
def func(x):
    return x*x

with Pool(cpu_count()) as pool:
    _ = list(tqdm(pool.imap(func, list(range(10000))), total=10000))

If however, I import anything from haystack, say, the Document class at any point in my notebook like so:

from haystack.schema import Document
from multiprocessing import Pool, cpu_count
from tqdm.notebook import tqdm

Then the kernel dies and I get an error message: "ZMQError: Address already in use"

Error message

Traceback (most recent call last): File "/home/ross/anaconda3/envs/haystack/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ross/anaconda3/envs/haystack/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in app.launch_new_instance() File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/traitlets/config/application.py", line 981, in launch_instance app.initialize(argv) File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/traitlets/config/application.py", line 110, in inner return method(app, *args, **kwargs) File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 666, in initialize self.init_sockets() File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 307, in init_sockets self.shell_port = self._bind_socket(self.shell_socket, self.shell_port) File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 244, in _bind_socket return self._try_bind_socket(s, port) File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 220, in _try_bind_socket s.bind("tcp://%s:%i" % (self.ip, port)) File "/home/ross/anaconda3/envs/haystack/lib/python3.10/site-packages/zmq/sugar/socket.py", line 232, in bind super().bind(addr) File "zmq/backend/cython/socket.pyx", line 568, in zmq.backend.cython.socket.Socket.bind File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc zmq.error.ZMQError: Address already in use

Expected behavior I want to be able to use multiprocessing for other tasks outside the context of haystack, so I expect that I should be able to use multiprocessing libraries.

Additional context This was performed in Jupyter Lab v3.5.0

I created a separate virtual environment for my jupyter kernel called haystack and installed haystack using the command:

pip install farm-haystack[gpu-all]

I had some issues with the install around FAISS and PyTorch:

Pip freeze of my environment:

aiohttp==3.8.3 aiorwlock==1.3.0 aiosignal==1.3.1 alembic==1.8.1 appdirs==1.4.4 astroid==2.12.13 asttokens==2.1.0 async-generator==1.10 async-timeout==4.0.2 attrs==22.1.0 audioread==3.0.0 azure-ai-formrecognizer==3.2.0 azure-common==1.1.28 azure-core==1.26.1 backcall==0.2.0 backoff==1.11.1 beautifulsoup4==4.11.1 beir==1.0.1 black==22.6.0 bleach==5.0.1 cattrs==22.2.0 certifi @ file:///croot/certifi_1665076670883/work/certifi cffi==1.15.1 cfgv==3.3.1 charset-normalizer==2.1.1 ci-sdr==0.0.2 click==8.0.4 cloudpickle==2.2.0 coloredlogs==15.0.1 ConfigArgParse==1.5.3 contourpy==1.0.6 coverage==6.5.0 ctc-segmentation==1.7.4 cycler==0.11.0 Cython==0.29.32 databind==1.5.3 databind.core==1.5.3 databind.json==1.5.3 databricks-cli==0.17.3 datasets==2.7.0 debugpy==1.6.3 decorator==5.1.1 defusedxml==0.7.1 Deprecated==1.2.13 dill==0.3.6 Distance==0.1.3 distlib==0.3.6 dnspython==2.2.1 docker==6.0.1 docopt==0.6.2 docspec==2.0.2 docspec-python==2.0.2 docstring-parser==0.11 einops==0.6.0 elasticsearch==7.9.1 entrypoints==0.4 espnet==202209 espnet-model-zoo==0.1.7 espnet-tts-frontend==0.0.3 exceptiongroup==1.0.4 executing==1.2.0 faiss-cpu==1.7.2 faiss-gpu==1.7.2 farm-haystack==1.11.0 fast-bss-eval==0.1.3 fastjsonschema==2.16.2 filelock==3.8.0 Flask==2.2.2 flatbuffers==22.10.26 fonttools==4.38.0 frozenlist==1.3.3 fsspec==2022.11.0 g2p-en==2.1.0 ghp-import==2.1.0 gitdb==4.0.9 GitPython==3.1.29 greenlet==2.0.1 grpcio==1.37.1 grpcio-tools==1.37.1 gunicorn==20.1.0 h11==0.14.0 h5py==3.7.0 huggingface-hub==0.11.0 humanfriendly==10.0 identify==2.5.9 idna==3.4 importlib-metadata==4.13.0 inflect==6.0.2 iniconfig==1.1.1 ipykernel==6.17.1 ipython==8.6.0 ipywidgets==8.0.2 isodate==0.6.1 isort==5.10.1 itsdangerous==2.1.2 jaconv==0.3 jamo==0.4.1 jarowinkler==1.2.3 jedi==0.18.2 Jinja2==3.1.2 joblib==1.2.0 jsonschema==4.17.0 jupyter_client==7.4.7 jupyter_core==5.0.0 jupytercontrib==0.0.7 jupyterlab-pygments==0.2.2 jupyterlab-widgets==3.0.3 kaldiio==2.17.2 kiwisolver==1.4.4 langdetect==1.0.9 lazy-object-proxy==1.8.0 librosa==0.9.2 llvmlite==0.39.1 loguru==0.6.0 lxml==4.9.1 Mako==1.2.4 Markdown==3.3.7 MarkupSafe==2.1.1 matplotlib==3.6.2 matplotlib-inline==0.1.6 mccabe==0.7.0 mergedeep==1.3.4 mistune==2.0.4 mkdocs==1.4.2 mlflow==2.0.1 mmh3==3.0.0 monotonic==1.6 more-itertools==9.0.0 mpmath==1.2.1 msgpack==1.0.4 msrest==0.7.1 multidict==6.0.2 multiprocess==0.70.14 mypy==0.991 mypy-extensions==0.4.3 nbclient==0.7.0 nbconvert==7.2.5 nbformat==5.7.0 nest-asyncio==1.5.6 networkx==2.8.8 nltk==3.7 nodeenv==1.7.0 nr.util==0.8.12 num2words==0.5.12 numba==0.56.4 numpy==1.23.5 oauthlib==3.2.2 onnx==1.12.0 onnxruntime-gpu==1.13.1 onnxruntime-tools==1.7.0 opensearch-py==2.0.0 outcome==1.2.0 packaging==21.3 pandas==1.5.1 pandocfilters==1.5.0 parso==0.8.3 pathspec==0.10.2 pdf2image==1.16.0 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.3.0 pinecone-client==2.0.13 platformdirs==2.5.4 pluggy==1.0.0 pooch==1.6.0 posthog==2.2.0 pre-commit==2.20.0 prompt-toolkit==3.0.33 protobuf==3.20.1 psutil==5.9.4 psycopg2-binary==2.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 py==1.11.0 py-cpuinfo==9.0.0 py3nvml==0.2.7 pyarrow==10.0.0 pycparser==2.21 pydantic==1.10.2 pydoc-markdown==4.6.4 pydub==0.25.1 Pygments==2.13.0 PyJWT==2.6.0 pylint==2.15.6 pymilvus==2.0.2 pyparsing==3.0.9 pypinyin==0.44.0 pyrsistent==0.19.2 PySocks==1.7.1 pytesseract==0.3.10 pytest==7.2.0 pytest-custom-exit-code==0.3.0 python-dateutil==2.8.2 python-docx==0.8.11 python-dotenv==0.21.0 python-magic==0.4.27 python-multipart==0.0.5 pytorch-wpe==0.0.1 pytrec-eval==0.5 pytz==2022.6 pyworld==0.3.2 PyYAML==5.4.1 pyyaml_env_tag==0.1 pyzmq==24.0.1 quantulum3==0.7.11 querystring-parser==1.2.4 rapidfuzz==2.7.0 ray==1.13.0 rdflib==6.2.0 regex==2022.10.31 requests==2.28.1 requests-cache==0.9.7 requests-oauthlib==1.3.1 resampy==0.4.2 responses==0.18.0 s3cmd==2.3.0 scikit-learn==1.1.3 scipy==1.9.3 selenium==4.6.0 sentence-transformers==2.2.2 sentencepiece==0.1.97 seqeval==1.2.2 shap==0.41.0 six==1.16.0 slicer==0.0.7 smmap==5.0.0 sniffio==1.3.0 sortedcontainers==2.4.0 soundfile==0.11.0 soupsieve==2.3.2.post1 SPARQLWrapper==2.0.0 SQLAlchemy==1.4.44 SQLAlchemy-Utils==0.38.3 sqlparse==0.4.3 stack-data==0.6.1 sympy==1.11.1 tabulate==0.9.0 threadpoolctl==3.1.0 tika==1.24 tinycss2==1.2.1 tokenize-rt==5.0.0 tokenizers==0.12.1 toml==0.10.2 tomli==2.0.1 tomli_w==1.0.0 tomlkit==0.11.6 torch==1.13.0+cu116 torch-complex==0.4.3 torchaudio==0.13.0+cu116 torchvision==0.14.0+cu116 tornado==6.2 tox==3.27.1 tqdm==4.64.1 traitlets==5.5.0 transformers==4.21.2 trio==0.22.0 trio-websocket==0.9.2 typeguard==2.13.3 typing_extensions==4.4.0 ujson==5.1.0 Unidecode==1.3.6 url-normalize==1.4.3 urllib3==1.26.12 validators==0.18.2 virtualenv==20.16.7 watchdog==2.1.9 wcwidth==0.2.5 weaviate-client==3.9.0 webdriver-manager==3.8.5 webencodings==0.5.1 websocket-client==1.4.2 Werkzeug==2.2.2 widgetsnbextension==4.0.3 wrapt==1.14.1 wsproto==1.2.0 xmltodict==0.13.0 xxhash==3.1.0 yapf==0.32.0 yarl==1.8.1 zipp==3.10.0

FAQ Check

System:

burtonrj commented 1 year ago

I also tried reinstalling haystack using the instructions on GitHub i.e. cloning the repo and installing from source, but I'm still getting the same error when I run list(tqdm(pool.imap(func, list(range(10000))), total=10000)) after importing anything from haystack.

masci commented 1 year ago

Hi @burtonrj I tried your snippet with the latest Haystack 1.11 on a fresh virtualenv, on my Macbook PRO with M1 inside a jupyter server and on two different Ubuntu boxes but I couldn't reproduce the error. Provided that Haystack doesn't use ZeroMQ directly, I think this might depend on something you have in your environment that I don't have but very hard to guess what.

Something you could try is see what's the process that's using the same ZMQ port as the shell_port of your jupyter server when you import Haystack:

In my case I don't see anything suspicious, the jupyter server is the process holding the port but maybe we can spot something different in your env.

EDIT: sorry hit the wrong button didn't mean to close the issue :)

burtonrj commented 1 year ago

Hi @masci thanks for picking this up, I've run the following as specified:

cat /home/ross/.local/share/jupyter/runtime/kernel-e52e49b1-53cc-4484-8617-532435b4963c.json { "shell_port": 57219, "iopub_port": 40149, "stdin_port": 59363, "control_port": 53635, "hb_port": 47257, "ip": "127.0.0.1", "key": "1fb3c200-5e835d3b90afd86b8de1157b", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "" }

lsof -i :57219

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME jupyter-l 74128 ross 30u IPv4 369706 0t0 TCP localhost:58662->localhost:57219 (ESTABLISHED) python 74422 ross 10u IPv4 356981 0t0 TCP localhost:57219 (LISTEN) python 74422 ross 17u IPv4 338800 0t0 TCP localhost:58658->localhost:57219 (ESTABLISHED) python 74422 ross 19u IPv4 338801 0t0 TCP localhost:57219->localhost:58658 (ESTABLISHED) python 74422 ross 52u IPv4 338806 0t0 TCP localhost:57219->localhost:58662 (ESTABLISHED)

I seem to have 4 other python processes using that port as well. If I restart the kernel, now it's now using port 39023:

cat /home/ross/.local/share/jupyter/runtime/kernel-d0461869-1f7a-469d-9063-27c679aff82a.json { "shell_port": 39023, "iopub_port": 60573, "stdin_port": 58031, "control_port": 60385, "hb_port": 55749, "ip": "127.0.0.1", "key": "3d718c0c-bf18003f7a8cb08aa4d223f0", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "" }

And I don't import haystack, I just literally start the notebook up and run no code, I still see the same 4 processes running:

lsof -i :39023

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME jupyter-l 74128 ross 25u IPv4 359767 0t0 TCP localhost:57336->localhost:39023 (ESTABLISHED) python 75575 ross 10u IPv4 368843 0t0 TCP localhost:39023 (LISTEN) python 75575 ross 17u IPv4 354805 0t0 TCP localhost:57332->localhost:39023 (ESTABLISHED) python 75575 ross 18u IPv4 354806 0t0 TCP localhost:39023->localhost:57332 (ESTABLISHED) python 75575 ross 51u IPv4 354810 0t0 TCP localhost:39023->localhost:57336 (ESTABLISHED)

Weird. Any thoughts? Sounds like it is just something funny happening on my machine. I'm happy for you to close this issue and I'll keep investigating and reopen it if necessary.

masci commented 1 year ago

@burtonrj first of all I just learned about the %connect_info magic to get the same JSON data, wish I knew it before eheh.

The ports are assigned randomly at startup by default, so it's ok they change. I have 6 processes myself, I think the jupyter server starts its own workers pool so that should be fine too. I'm curious to see if just importing Haystack changes anything, like you see an error or everything stays the same.

I'll keep trying to reproduce...

burtonrj commented 1 year ago

So I've been experimenting. When I published the original issue, I had multiple envs that I was pointing ipykernel to in a base environment in conda from which I was launching jupyter lab, which I will admit, is a bit messy. I wondered if this was the problem. Therefore, I've tried to recreate the issue from within a single environment where both haystack and jupyter are installed and I run the jupyter server from within that env. But it just gets weirder.

I can use multiprocessing absolutely fine under two conditions:

  1. I run the above scenarios outside of jupyter - so just within a Python shell - absolutely fine, no errors, no issues.
  2. I run the above scenarios inside jupyter BUT I import multiprocessing first, again, no errors.

As soon as I run it inside Jupyter BUT I import haystack first, the kernel dies. WEIRD. Ultimately, I think this is might be a Jupyter issue within my Ubuntu build and not a haystack issue.

P.S. I've tried this with venv, hatch, and poetry for env management, and get the same issue every time.

masci commented 1 year ago

Hi @burtonrj can we close this issue or are you still facing that error?

burtonrj commented 1 year ago

Happy holidays @masci, this issue has resolved itself, still not sure what caused it but must be machine specific. Feel free to close.

masci commented 1 year ago

Thanks @burtonrj happy holidays to you too!