deepset-ai / hayhooks

Deploy Haystack pipelines behind a REST Api.
https://haystack.deepset.ai
Apache License 2.0
39 stars 11 forks source link

Issue with Unstructured Document Converter (Related to Asyncio) #26

Open karbasia opened 3 months ago

karbasia commented 3 months ago

I'm testing a pipeline that utilizes the Unstructured Converter component for processing PDFs. The pipeline works locally, but fails with Hayhooks.

The error is as follows:

2024-06-07 14:41:27 
Converting files to Haystack Documents: 0it [00:00, ?it/s]Unstructured could not process file /data/file.pdf. Error: Traceback (most recent call last):
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/haystack_integrations/components/converters/unstructured/converter.py", line 198, in _partition_file_into_elements
2024-06-07 14:41:27     elements = partition_via_api(
2024-06-07 14:41:27                ^^^^^^^^^^^^^^^^^^
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/unstructured/partition/api.py", line 70, in partition_via_api
2024-06-07 14:41:27     sdk = UnstructuredClient(api_key_auth=api_key, server_url=base_url)
2024-06-07 14:41:27           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/unstructured_client/sdk.py", line 54, in __init__
2024-06-07 14:41:27     self.sdk_configuration = SDKConfiguration(
2024-06-07 14:41:27                              ^^^^^^^^^^^^^^^^^
2024-06-07 14:41:27   File "<string>", line 13, in __init__
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/unstructured_client/sdkconfiguration.py", line 38, in __post_init__
2024-06-07 14:41:27     self._hooks = SDKHooks()
2024-06-07 14:41:27                   ^^^^^^^^^^
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/unstructured_client/_hooks/sdkhooks.py", line 15, in __init__
2024-06-07 14:41:27     init_hooks(self)
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/unstructured_client/_hooks/registration.py", line 28, in init_hooks
2024-06-07 14:41:27     split_pdf_hook = SplitPdfHook()
2024-06-07 14:41:27                      ^^^^^^^^^^^^^^
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/unstructured_client/_hooks/custom/split_pdf_hook.py", line 73, in __init__
2024-06-07 14:41:27     nest_asyncio.apply()
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/nest_asyncio.py", line 19, in apply
2024-06-07 14:41:27     _patch_loop(loop)
2024-06-07 14:41:27   File "/opt/venv/lib/python3.12/site-packages/nest_asyncio.py", line 193, in _patch_loop
2024-06-07 14:41:27     raise ValueError('Can\'t patch loop of type %s' % type(loop))
2024-06-07 14:41:27 ValueError: Can't patch loop of type <class 'uvloop.Loop'>

The line of code that causes this to fail is nest_asyncio.apply().

After some research, I fixed the issue for myself by updating the cli code to the following. I not an expert here and wanted to know if this approach is fine?

import click
import uvicorn
import os
import sys
import asyncio

@click.command()
@click.option('--host', default="localhost")
@click.option('--port', default=1416)
@click.option('--pipelines-dir', default=os.environ.get("HAYHOOKS_PIPELINES_DIR"))
@click.option('--additional-python-path', default=os.environ.get("HAYHOOKS_ADDITIONAL_PYTHONPATH"))
def run(host, port, pipelines_dir, additional_python_path):
    if not pipelines_dir:
        pipelines_dir = "pipelines.d"
    os.environ["HAYHOOKS_PIPELINES_DIR"] = pipelines_dir

    if additional_python_path:
        sys.path.append(additional_python_path)

    loop = asyncio.new_event_loop()
    config = uvicorn.Config("hayhooks.server:app", host=host, port=port, loop=loop)
    server = uvicorn.Server(config)
    loop.run_until_complete(server.serve())