Open DsDastgheib opened 4 months ago
Describe the bug The output of the pdf partitioner for right-to-left languages is incorrect.
To Reproduce I've downloaded a sample pdf from this link, then using the following code
filename = "Path_to_the_sample_pdf_file" with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = shared.PartitionParameters(files=files) try: resp = client.general.partition(req) except SDKError as e: print(e)
I've got the following output (only part of it):
PartitionResponse(content_type='application/json', status_code=200, raw_response=<Response [200]>, elements=[{'type': 'Header', 'element_id': '4e8ada3c22ab6f719d3a16379b9d2ca5', 'text': 'See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/381042047', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'drbarh.pdf'}}, {'type': 'Title', 'element_id': '193c5b2dbecb6826b3e4d0ad1a37e699', 'text': 'ﻲﻣدﺎﻛآ و هﺮﻣزور ﻲﮔﺪﻧز رد بﻮﺧ عوﺮﺷ ﻚﻳ هرﺎﺑرد', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'parent_id': '4e8ada3c22ab6f719d3a16379b9d2ca5', 'filename': 'drbarh.pdf'}}, {'type': 'NarrativeText', 'element_id': 'a632662d5c3182a47e0a547204c7a311', 'text': 'Article · June 2024', 'metadata': {'filetype': 'application/pdf',
Expected behavior The text should be like this (It seems it reverted):
دربارهٔ یک شروع خوب در زندگی روزمره و آکادمی
Additional context The problem with text is for the whole document, and also changing the language won't help.
Hi @DsDastgheib - thanks for the report, we'll take a look at this. cc: @leah1985
Describe the bug The output of the pdf partitioner for right-to-left languages is incorrect.
To Reproduce I've downloaded a sample pdf from this link, then using the following code
I've got the following output (only part of it):
Expected behavior The text should be like this (It seems it reverted):
Additional context The problem with text is for the whole document, and also changing the language won't help.