Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.43k stars 692 forks source link

bug/Arabic OCR omits certain lines #3502

Open darrayes opened 1 month ago

darrayes commented 1 month ago

I am trying to parse an Arabic file however what i observed is that, it skips certain lines from the text. I am attaching a page from the pdf and the corresponding text lines that have been missed are highlighted in boxes. image

christinestraub commented 1 month ago

Hi @darrayes

Can you provide the code snippet that produces this behavior?

darrayes commented 1 month ago

The code snippet is as follows:

from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
os.environ["UNSTRUCTURED_API_KEY"] = "*****"
os.environ["UNSTRUCTURED_API_URL"] = "https://api.unstructured.io/general/v0/general"

client = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL")
)

input_filepath = "test-arabic-llm-cleaned.pdf"
output_filepath = "text-file.txt"
with open(input_filepath, "rb") as f:
    files = shared.Files(
        content=f.read(),
        file_name=input_filepath
    )

req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
        strategy=shared.Strategy.OCR_ONLY,
        languages=['ara'],
        chunking_strategy=shared.ChunkingStrategy.BY_TITLE,
        overlap=100,
        )
)

try:
    res = client.general.partition(request=req)
    element_dicts = [element for element in res.elements]
    json_elements = json.dumps(element_dicts, indent=2)

    # Print the processed data.
    print(json_elements)

    # Write the processed data to a local file.
    with open(output_filepath, "w") as file:
      file.write(json_elements)
except Exception as e:
    print(e)