Unstructured-IO / unstructured-api

Apache License 2.0
427 stars 93 forks source link

ERROR - Exception in ASGI application / Error in pixCreateHeader #333

Open zakariamehbi opened 6 months ago

zakariamehbi commented 6 months ago

Describe the bug

2023-12-23 13:47:01,827 41.250.50.106:53059 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
2023-12-23 13:47:01,827 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app
    raw_response = await run_endpoint_function(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/notebook-user/prepline_general/api/general.py", line 811, in pipeline_1
    list(response_generator(is_multipart=False))[0]
  File "/home/notebook-user/prepline_general/api/general.py", line 749, in response_generator
    response = pipeline_api(
  File "/home/notebook-user/prepline_general/api/general.py", line 434, in pipeline_api
    elements = partition(**partition_kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/auto.py", line 384, in partition
    elements = _partition_pdf(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/documents/elements.py", line 503, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 591, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 546, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/chunking/title.py", line 241, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 172, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 279, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 409, in _partition_pdf_or_image_local
    final_layout = process_data_with_ocr(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/ocr.py", line 82, in process_data_with_ocr
    merged_layouts = process_file_with_ocr(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/ocr.py", line 168, in process_file_with_ocr
    raise e
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/ocr.py", line 157, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/ocr.py", line 190, in supplement_page_layout_with_ocr
    ocr_layout = get_ocr_layout_from_image(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/ocr.py", line 430, in get_ocr_layout_from_image
    ocr_regions = get_ocr_layout_tesseract(image, ocr_languages)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/ocr.py", line 465, in get_ocr_layout_tesseract
    ocr_df = unstructured_pytesseract.image_to_data(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
    return {
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (1, 'Error in pixCreateHeader: requested w = 34680, h = 48360, d = 32 Error in pixCreateHeader: requested bytes >= 2^31 Error in pixCreateNoInit: pixd not made Error in pixCreate: pixd not made Error in pixReadStreamPng: pix not made Error in pixReadStream: png: no pix returned Error in pixRead: pix not read Error during processing.')

To Reproduce

var myHeaders = new Headers();
var formdata = new FormData();
formdata.append("files", fileInput.files[0], "ΣΤ ΔΗΜΟΤΙΚΟΥ 3.pdf");
formdata.append("output_format", "application/json");
formdata.append("coordinates", "false");
formdata.append("encoding", "utf-8");
formdata.append("hi_res_model_name", "detectron2_onnx");
formdata.append("include_page_breaks", "false");
formdata.append("ocr_languages", "");
formdata.append("pdf_infer_table_structure", "true");
formdata.append("skip_infer_table_types", "jpg, png");
formdata.append("strategy", "hi_res");
formdata.append("xml_keep_tags", "true");

var requestOptions = {
        method: 'POST',
        headers: myHeaders,
        body: formdata,
        redirect: 'follow'
};

fetch("http://my_hosted_api:8000/general/v0/general", requestOptions)
        .then(response => response.text())
        .then(result => console.log(result))
        .catch(error => console.log('error', error));

Filetype

Environment:

awalker4 commented 6 months ago

Hi there, this error has hopefully been fixed in the library here. We're a bit behind on the unstructured version in the requirements here - can you try pip install unstructured==0.11.6 and see if this is resolved?

zakariamehbi commented 6 months ago

Hello @awalker4, I can't because I'm using the docker image, is there any other way? About the versions, is there a reason for the lagging behind? Thank you

awalker4 commented 6 months ago

No good reason other than our Dependabot seems to broken 😂 Hang tight, I'll bump the versions now to get a new image out.

zakariamehbi commented 6 months ago

Thank you @awalker4 😂

zakariamehbi commented 6 months ago

@awalker4 No new docker image?

awalker4 commented 6 months ago

Ah, seems this job just needs to finish: https://github.com/Unstructured-IO/unstructured-api/actions/runs/7310799243

zakariamehbi commented 6 months ago

@awalker4 I did another test with the new image and unfortunately, I got the same error.

awalker4 commented 4 months ago

Apologies, this bug slipped off the radar. Are you still seeing this issue?