Unstructured-IO / unstructured-python-client

A Python client for the Unstructured hosted API
MIT License
74 stars 13 forks source link

bug/Can't patch loop of type <class 'uvloop.Loop'> #154

Open sigridjineth opened 1 month ago

sigridjineth commented 1 month ago

Here's the GitHub issue formatted as requested:

Describe the bug When attempting to use the UnstructuredClient to parse a PDF document, a ValueError is thrown due to an incompatibility with uvloop. This occurs when initializing the SplitPdfHook in the UnstructuredClient. The error suggests that nest_asyncio is unable to patch the uvloop.Loop.

The version that I am using.

unstructured==0.15.1
unstructured-client==0.23.9

To Reproduce

from unstructured_client import UnstructuredClient
from langchain_community.document_loaders import UnstructuredAPIFileLoader

client = UnstructuredClient()

loader = UnstructuredAPIFileLoader(
    file_path="path/to/your/document.pdf",
    api_key="your-api-key",
    api_url="your-api-url"
)

# This line triggers the error
documents = loader.load_and_split()

Expected behavior The UnstructuredClient should initialize successfully and be able to parse the PDF document without throwing a ValueError related to uvloop.

Environment Info

Please run `python scripts/collect_env.py` and paste the output here.
This will help us understand more about the environment in which the bug occurred.

Note: As I don't have access to run this script, please run it in your environment and paste the output here.

Additional context

Traceback

raceback (most recent call last):
  File "/app/pylon/core/document/unstructured.py", line 30, in parse_document_with_unstructuredio
    ).load_and_split()
      ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 64, in load_and_split
    docs = self.load()
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 30, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 107, in lazy_load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 333, in _get_elements
    return get_elements_from_api(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 261, in get_elements_from_api
    return partition_via_api(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/unstructured/partition/api.py", line 69, in partition_via_api
    sdk = UnstructuredClient(api_key_auth=api_key, server_url=base_url)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/unstructured_client/sdk.py", line 54, in __init__
    self.sdk_configuration = SDKConfiguration(
                             ^^^^^^^^^^^^^^^^^
  File "<string>", line 13, in __init__
  File "/usr/local/lib/python3.11/site-packages/unstructured_client/sdkconfiguration.py", line 38, in __post_init__
    self._hooks = SDKHooks()
                  ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/unstructured_client/_hooks/sdkhooks.py", line 15, in __init__
    init_hooks(self)
  File "/usr/local/lib/python3.11/site-packages/unstructured_client/_hooks/registration.py", line 28, in init_hooks
    split_pdf_hook = SplitPdfHook()
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/unstructured_client/_hooks/custom/split_pdf_hook.py", line 73, in __init__
    nest_asyncio.apply()
  File "/usr/local/lib/python3.11/site-packages/nest_asyncio.py", line 18, in apply
    loop = loop or asyncio.get_event_loop()
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nest_asyncio.py", line 40, in _get_event_loop
    loop = events.get_event_loop_policy().get_event_loop()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nest_asyncio.py", line 67, in get_event_loop
    _patch_loop(loop)
  File "/usr/local/lib/python3.11/site-packages/nest_asyncio.py", line 193, in _patch_loop
    raise ValueError('Can\'t patch loop of type %s' % type(loop))
ValueError: Can't patch loop of type <class 'uvloop.Loop'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/pylon/routers/document.py", line 43, in create_documents_process
    document_info: DocumentsInfo = save_document(document, request.agent, request.organize)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/pylon/services/knowledge.py", line 67, in save_document
    parsed_document = parse_document(
                      ^^^^^^^^^^^^^^^
  File "/app/pylon/core/document/parser.py", line 126, in parse_document
    raise e
  File "/app/pylon/core/document/parser.py", line 119, in parse_document
    documents: list[LCDocument] = Parallel(n_jobs=-1, prefer='processes')(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/app/pylon/core/document/unstructured.py", line 38, in parse_document_with_unstructuredio
    raise FileParserAPIError(f'Failed to parse document from unstructured-io: {filename}. Error: {e!s}') from e
pylon.exceptions.custom_exceptions.FileParserAPIError: Failed to connect or communicate with the file parser server. details: Failed to parse document from unstructured-io. Error: Can't patch loop of type <class 'uvloop.Loop'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 162, in __call__
    await self.background()
  File "/usr/local/lib/python3.11/site-packages/starlette/background.py", line 45, in __call__
    await task()
  File "/usr/local/lib/python3.11/site-packages/starlette/background.py", line 30, in __call__
    await run_in_threadpool(self.func, *self.args, **self.kwargs)
  File "/usr/local/lib/python3.11/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/pylon/routers/document.py", line 46, in create_documents_process
    handle_exception(
  File "/app/pylon/exceptions/handlers.py", line 46, in handle_exception
    send_callback(callback_url, error_response)
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/app/pylon/utils/callbacks.py", line 11, in send_callback
    response.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Any guidance on resolving this issue or workarounds would be greatly appreciated.

awalker4 commented 1 month ago

Hi @sigridjineth, we have a fix for this error in 0.25.2. Can you upgrade and confirm that this fixes the issue? Unfortunately the solution right now is to fall back to non splitting mode in a uvloop context, but at least we can prevent the error. Stay tuned for a better fix for splitting large pdfs in a nested event loop context.

sigridjineth commented 1 month ago

@awalker4 thanks for checking it out!