Is your feature request related to a problem? Please describe.
All text indentations and some line breaks are lost in partitioned PDF files. If there're code blocks in the PDF file, the extracted elements can't represent them.
Describe the solution you'd like
Retain indentations and line breaks in documents.
Additional context
My test PDF is as simple as this:
After partition_pdf(), the element's text looks like this:
Which does not retain any format that might be crucial for understanding what a block of text means.
Is your feature request related to a problem? Please describe. All text indentations and some line breaks are lost in partitioned PDF files. If there're code blocks in the PDF file, the extracted elements can't represent them.
Describe the solution you'd like Retain indentations and line breaks in documents.
Describe alternatives you've considered I see that there's a PR related: https://github.com/Unstructured-IO/unstructured/pull/2428. But idk.
Additional context My test PDF is as simple as this:
After
partition_pdf()
, the element's text looks like this: Which does not retain any format that might be crucial for understanding what a block of text means.