Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug/right2left_pdf_output #3232

Open DsDastgheib opened 1 week ago

DsDastgheib commented 1 week ago

Describe the bug The output of the pdf partitioner for right-to-left languages is incorrect.

To Reproduce I've downloaded a sample pdf from this link, then using the following code

filename = "Path_to_the_sample_pdf_file"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(files=files)

try:
    resp = client.general.partition(req)
except SDKError as e:
    print(e)

I've got the following output (only part of it):

PartitionResponse(content_type='application/json', status_code=200, raw_response=<Response [200]>, elements=[{'type': 'Header', 'element_id': '4e8ada3c22ab6f719d3a16379b9d2ca5', 'text': 'See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/381042047', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'drbarh.pdf'}}, {'type': 'Title', 'element_id': '193c5b2dbecb6826b3e4d0ad1a37e699', 'text': 'ﻲﻣدﺎﻛآ و هﺮﻣزور ﻲﮔﺪﻧز رد بﻮﺧ عوﺮﺷ ﻚﻳ هرﺎﺑرد', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'parent_id': '4e8ada3c22ab6f719d3a16379b9d2ca5', 'filename': 'drbarh.pdf'}}, {'type': 'NarrativeText', 'element_id': 'a632662d5c3182a47e0a547204c7a311', 'text': 'Article · June 2024', 'metadata': {'filetype': 'application/pdf',

Expected behavior The text should be like this (It seems it reverted):

دربارهٔ یک شروع خوب در زندگی روزمره و آکادمی

Additional context The problem with text is for the whole document, and also changing the language won't help.

MthwRobinson commented 1 hour ago

Hi @DsDastgheib - thanks for the report, we'll take a look at this. cc: @leah1985