Open cw5d opened 11 months ago
I'm having the same issue when extracting images.
2024-06-17 14:05:24,476 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
return await future
File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
result = context.run(func, *args)
File "/home/notebook-user/prepline_general/api/general.py", line 850, in general_partition
list(response_generator(is_multipart=False))[0]
File "/home/notebook-user/prepline_general/api/general.py", line 785, in response_generator
response = pipeline_api(
File "/home/notebook-user/prepline_general/api/general.py", line 440, in pipeline_api
elements = partition_pdf_splits(
File "/home/notebook-user/prepline_general/api/general.py", line 220, in partition_pdf_splits
return partition(
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/auto.py", line 426, in partition
elements = _partition_pdf(
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/documents/elements.py", line 593, in wrapper
elements = func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
elements = func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
elements = func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 192, in partition_pdf
return partition_pdf_or_image(
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 288, in partition_pdf_or_image
elements = _partition_pdf_or_image_local(
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/utils.py", line 249, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 676, in _partition_pdf_or_image_local
save_elements(
File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py", line 195, in save_elements
image_path = image_paths[page_number - 1]
IndexError: list index out of range
2024-06-17 14:05:24,478 unstructured_api INFO Backing off call_api(...) for 1.8s (fastapi.exceptions.HTTPException: 500: list index out of range)
2024-06-17 14:05:26,273 unstructured_api DEBUG pipeline_api input params: {"filename": "3a782d85-d311-45e4-a38a-f02f7d7ebce7.pdf", "response_type": "application/json", "coordinates": false, "encoding": "utf-8", "hi_res_model_name": null, "include_page_breaks": false, "ocr_languages": null, "pdf_infer_table_structure": true, "skip_infer_table_types": ["jpg,png"], "strategy": "auto", "xml_keep_tags": false, "languages": ["eng,deu,fas,ara,heb,fra"], "extract_image_block_types": ["Image"], "unique_element_ids": false, "chunking_strategy": null, "combine_under_n_chars": null, "max_characters": 2000, "multipage_sections": true, "new_after_n_chars": null, "overlap": 0, "overlap_all": false, "starting_page_number": 10}
I'm using parallel mode using environment variable in the api.
UNSTRUCTURED_PARALLEL_MODE_ENABLED=true
UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE=10
UNSTRUCTURED_PARALLEL_MODE_THREADS=10
UNSTRUCTURED_PARALLEL_MODE_URL=http://localhost:8000/general/v0/general
UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS=3
Describe the bug A strange one.
IndexError: list index out of range
when OCR'ing a portion of a pdf doc, but depending on the split size, it doesn't always happen. My guess is that the first page matters.Relevant stack trace:
To Reproduce Provide a code snippet that reproduces the issue. 0uupv_Artisi+-+Brochure+-+FINAL06.06.23.pdf
If you split it by 10 pages per split you will see that the 30-40 range is the one that throws out this error, but the rest are fine. 5-page per split also has this issue. But for other split sizes such as 40, there are no errors.
Expected behavior Well it shouldn't error out randomly depending on the split size :)
Screenshots N/A
Environment Info Python 3.11 on Mac; Also seen on Ubuntu
Additional context
hi_res
extraction; only encountered this error once with this specific pdf file, as attached.