Unstructured-IO / unstructured-api

Apache License 2.0
528 stars 110 forks source link

Error inferring tables on docker api (hi_res / pdf) #310

Closed jrcapicua closed 10 months ago

jrcapicua commented 11 months ago

When running table inference for PDF files using the API the docker exits due to a segmentation fault

Run quay.io/unstructured-io/unstructured-api:latest image and just make the request.

import requests

url = 'http://localhost:8000/general/v0/general'

headers = {
    'accept': 'application/json',
    'unstructured-api-key': 'XXXX',
}

data = {
    "strategy": "hi_res",
    "pdf_infer_table_structure": "true",
    "skip_infer_table_types": ['jpg', 'png']
}

file_path = "IF10244.pdf"
file_data = {'files': open(file_path, 'rb')}

response = requests.post(url, headers=headers, data=data, files=file_data)

file_data['files'].close()

json_response = response.json()

print(json_response)

Environment:

Additional context Log from docker image:

2023-11-15 23:42:13 2023-11-16 02:42:13,812 unstructured_api INFO Started Unstructured API
2023-11-15 23:42:13 2023-11-16 02:42:13,813 uvicorn.error INFO Started server process [7]
2023-11-15 23:42:13 2023-11-16 02:42:13,813 uvicorn.error INFO Waiting for application startup.
2023-11-15 23:42:13 2023-11-16 02:42:13,813 uvicorn.error INFO Application startup complete.
2023-11-15 23:42:13 2023-11-16 02:42:13,813 uvicorn.error INFO Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2023-11-15 23:42:38 2023-11-16 02:42:38,601 unstructured_api DEBUG pipeline_api input params: {"filename": "wildfire_stats.pdf", "response_type": "application/json", "m_coordinates": [], "m_encoding": [], "m_hi_res_model_name": [], "m_include_page_breaks": [], "m_ocr_languages": null, "m_pdf_infer_table_structure": ["true"], "m_skip_infer_table_types": ["jpg", "png"], "m_strategy": ["hi_res"], "m_xml_keep_tags": [], "languages": null, "m_chunking_strategy": [], "m_multipage_sections": [], "m_combine_under_n_chars": [], "new_after_n_chars": [], "m_max_characters": []}
2023-11-15 23:42:38 2023-11-16 02:42:38,601 unstructured_api DEBUG filetype: application/pdf
2023-11-15 23:42:38 2023-11-16 02:42:38,603 unstructured_api DEBUG partition input data: {"content_type": "application/pdf", "strategy": "hi_res", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": true, "include_page_breaks": false, "encoding": null, "model_name": null, "xml_keep_tags": false, "skip_infer_table_types": "jpg", "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": 500, "new_after_n_chars": 1500, "max_characters": 1500}
2023-11-15 23:42:39 2023-11-16 02:42:39,603 unstructured_inference INFO Reading PDF for file: /tmp/tmp8464xqm3 ...
2023-11-15 23:42:40 2023-11-16 02:42:40,033 unstructured_inference INFO Detecting page elements ...
2023-11-15 23:42:41 2023-11-16 02:42:41,364 unstructured_inference INFO Detecting page elements ...
2023-11-15 23:42:42 2023-11-16 02:42:42,670 unstructured_inference INFO Detecting page elements ...
2023-11-15 23:42:44 2023-11-16 02:42:44,153 unstructured INFO Processing entire page OCR with tesseract...
2023-11-15 23:42:49 Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
2023-11-15 23:42:49 - This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
2023-11-15 23:42:49 - This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2023-11-15 23:42:49 2023-11-16 02:42:49,324 unstructured_inference INFO padding image by 20 for structure detection
2023-11-15 23:42:49 scripts/app-start.sh: line 5:     7 Segmentation fault      uvicorn prepline_general.api.app:app --log-config logger_config.yaml --host 0.0.0.0

Pdf used: IF10244.pdf | wildfire_stats

awalker4 commented 10 months ago

Hi there, is this on a mac? There is a segfault that happens on M1 chips. At the moment we aren't pursuing this due to unsupported hardware. If this is the case for you, we recommend running on a cloud instance. See also https://github.com/Unstructured-IO/unstructured-api/issues/275

awalker4 commented 10 months ago

Closing, but feel free to reopen if this isn't a Mac issue.