Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.66k stars 707 forks source link

fix: `partition_pdf()` removes spaces from the text #3106

Closed christinestraub closed 4 months ago

christinestraub commented 4 months ago

Closes #2896.

This PR aims to fix partition_pdf() to keep spaces in text. The control character \t is now replaced with a space instead of being removed when merging inferred and embedded elements.

Testing

PDF: rok_20230930_1-1.pdf

elements = partition_pdf(
    filename="rok_20230930_1-1.pdf",
    strategy="hi_res",
)

print(str(elements[20]))

Results: