Unstructured-IO / unstructured-api

Apache License 2.0
489 stars 102 forks source link

Chipperv2 outputs incorrect table structure and text #319

Closed six5532one closed 3 months ago

six5532one commented 9 months ago

Describe the bug Yokebe.pdf Bebevita.pdf Mykoforte.pdf

To Reproduce See attached documents. A user used the hosted API with the chipperv2 model. They also tried setting "languages" to "['deu']" and "OCR_AGENT" to "paddle" but noticed no difference. Here is their code:

import requests

unstructured_api_key = '.............' 
unstructured_api_headers = {
    "accept": "application/json",
    "unstructured-api-key": unstructured_api_key
}

unstructured_api_url = "https://api.unstructured.io/general/v0/general"

data = {
    "strategy": "hi_res",
    "pdf_infer_table_structure": "true",
    "hi_res_model_name": "yolox", --> change to chipperv2
    "languages": "['eng']"
}

file_path = "..............."
file_data = {'files': open(file_path, 'rb')}

response = requests.post(url=unstructured_api_url,
                         files=file_data,
                         data=data,
                         headers=unstructured_api_headers)
MthwRobinson commented 3 months ago

Closing this because Chipper is only supported in the SaaS API