Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.65k stars 704 forks source link

bug/IndexError on OCR for certain pdf pages after page split #2255

Open cw5d opened 9 months ago

cw5d commented 9 months ago

Describe the bug A strange one.

IndexError: list index out of range when OCR'ing a portion of a pdf doc, but depending on the split size, it doesn't always happen. My guess is that the first page matters.

Relevant stack trace:

.venv/lib/python3.11/site-packages/unstructured/partition/ocr.py:171: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

filename = '/var/folders/5w/hcnw_g8d3cn9j_373dxm6jrm0000gn/T/tmpglza5pmp', out_layout = <unstructured_inference.inference.layout.DocumentLayout object at 0x16fddcdd0>, is_image = False
infer_table_structure = True, ocr_languages = 'eng', ocr_mode = 'entire_page', pdf_image_dpi = 200

    def process_file_with_ocr(
        filename: str,
        out_layout: "DocumentLayout",
        is_image: bool = False,
        infer_table_structure: bool = False,
        ocr_languages: str = "eng",
        ocr_mode: str = OCRMode.FULL_PAGE.value,
        pdf_image_dpi: int = 200,
    ) -> "DocumentLayout":
        """
        Process OCR data from a given file and supplement the output DocumentLayout
        from unsturcutured-inference with ocr.

        Parameters:
        - filename (str): The path to the input file, which can be an image or a PDF.

        - out_layout (DocumentLayout): The output layout from unstructured-inference.

        - is_image (bool, optional): Indicates if the input data is an image (True) or not (False).
            Defaults to False.

        - infer_table_structure (bool, optional):  If true, extract the table content.

        - ocr_languages (str, optional): The languages for OCR processing. Defaults to "eng" (English).

        - ocr_mode (str, optional): The OCR processing mode, e.g., "entire_page" or "individual_blocks".
            Defaults to "entire_page". If choose "entire_page" OCR, OCR processes the entire image
            page and will be merged with the output layout. If choose "individual_blocks" OCR,
            OCR is performed on individual elements by cropping the image.

        - pdf_image_dpi (int, optional): DPI (dots per inch) for processing PDF images. Defaults to 200.

        Returns:
            DocumentLayout: The merged layout information obtained after OCR processing.
        """
        merged_page_layouts = []
        try:
            if is_image:
                with PILImage.open(filename) as images:
                    image_format = images.format
                    for i, image in enumerate(ImageSequence.Iterator(images)):
                        image = image.convert("RGB")
                        image.format = image_format
                        merged_page_layout = supplement_page_layout_with_ocr(
                            out_layout.pages[i],
                            image,
                            infer_table_structure=infer_table_structure,
                            ocr_languages=ocr_languages,
                            ocr_mode=ocr_mode,
                        )
                        merged_page_layouts.append(merged_page_layout)
                    return DocumentLayout.from_pages(merged_page_layouts)
            else:
                with tempfile.TemporaryDirectory() as temp_dir:
                    _image_paths = pdf2image.convert_from_path(
                        filename,
                        dpi=pdf_image_dpi,
                        output_folder=temp_dir,
                        paths_only=True,
                    )
                    image_paths = cast(List[str], _image_paths)
                    for i, image_path in enumerate(image_paths):
                        with PILImage.open(image_path) as image:
                            merged_page_layout = supplement_page_layout_with_ocr(
>                               out_layout.pages[i],
                                image,
                                infer_table_structure=infer_table_structure,
                                ocr_languages=ocr_languages,
                                ocr_mode=ocr_mode,
                            )
E                           IndexError: list index out of range

.venv/lib/python3.11/site-packages/unstructured/partition/ocr.py:161: IndexError

To Reproduce Provide a code snippet that reproduces the issue. 0uupv_Artisi+-+Brochure+-+FINAL06.06.23.pdf

If you split it by 10 pages per split you will see that the 30-40 range is the one that throws out this error, but the rest are fine. 5-page per split also has this issue. But for other split sizes such as 40, there are no errors.

Expected behavior Well it shouldn't error out randomly depending on the split size :)

Screenshots N/A

Environment Info Python 3.11 on Mac; Also seen on Ubuntu

Additional context hi_res extraction; only encountered this error once with this specific pdf file, as attached.

alimoezzi commented 3 months ago

I'm having the same issue when extracting images.

2024-06-17 14:05:24,476 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
  File "/home/notebook-user/prepline_general/api/general.py", line 850, in general_partition
    list(response_generator(is_multipart=False))[0]
  File "/home/notebook-user/prepline_general/api/general.py", line 785, in response_generator
    response = pipeline_api(
  File "/home/notebook-user/prepline_general/api/general.py", line 440, in pipeline_api
    elements = partition_pdf_splits(
  File "/home/notebook-user/prepline_general/api/general.py", line 220, in partition_pdf_splits
    return partition(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/auto.py", line 426, in partition
    elements = _partition_pdf(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 192, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 288, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 676, in _partition_pdf_or_image_local
    save_elements(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py", line 195, in save_elements
    image_path = image_paths[page_number - 1]
IndexError: list index out of range
2024-06-17 14:05:24,478 unstructured_api INFO Backing off call_api(...) for 1.8s (fastapi.exceptions.HTTPException: 500: list index out of range)
2024-06-17 14:05:26,273 unstructured_api DEBUG pipeline_api input params: {"filename": "3a782d85-d311-45e4-a38a-f02f7d7ebce7.pdf", "response_type": "application/json", "coordinates": false, "encoding": "utf-8", "hi_res_model_name": null, "include_page_breaks": false, "ocr_languages": null, "pdf_infer_table_structure": true, "skip_infer_table_types": ["jpg,png"], "strategy": "auto", "xml_keep_tags": false, "languages": ["eng,deu,fas,ara,heb,fra"], "extract_image_block_types": ["Image"], "unique_element_ids": false, "chunking_strategy": null, "combine_under_n_chars": null, "max_characters": 2000, "multipage_sections": true, "new_after_n_chars": null, "overlap": 0, "overlap_all": false, "starting_page_number": 10}

I'm using parallel mode using environment variable in the api.

UNSTRUCTURED_PARALLEL_MODE_ENABLED=true 
UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE=10 
UNSTRUCTURED_PARALLEL_MODE_THREADS=10 
UNSTRUCTURED_PARALLEL_MODE_URL=http://localhost:8000/general/v0/general 
UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS=3