Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.66k stars 707 forks source link

feat/Retain text indentations in PDF files #3146

Closed ChiNoel-osu closed 4 months ago

ChiNoel-osu commented 4 months ago

Is your feature request related to a problem? Please describe. All text indentations and some line breaks are lost in partitioned PDF files. If there're code blocks in the PDF file, the extracted elements can't represent them.

Describe the solution you'd like Retain indentations and line breaks in documents.

Describe alternatives you've considered I see that there's a PR related: https://github.com/Unstructured-IO/unstructured/pull/2428. But idk.

Additional context My test PDF is as simple as this: image

After partition_pdf(), the element's text looks like this: image Which does not retain any format that might be crucial for understanding what a block of text means.

MthwRobinson commented 4 months ago

Hi @ChiNoel-osu - closing this one in favor of #711, which would track indent level as a metadata field