Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.04k stars 746 forks source link

bug: different pdf text extraction output from `unstructured` and `unstructed-inference` libarary #2164

Closed christinestraub closed 11 months ago

christinestraub commented 11 months ago

Describe the bug PDF text extraction by pdfminer works differently in unstructured repo and unstructured-inference repo.

To Reproduce PDF: algebra-graph-level1-1.pdf

layout = process_file_with_model( filename="algebra-graph-level1-1.pdf", model_name=None, )

elements = layout.pages[0].elements

print("\n\n".join([str(el) for el in elements])) print(f"n_of_elements: {len(elements)}")

- unstructured repo

unstructured repo

elements = partition_pdf( filename="algebra-graph-level1-1.pdf", strategy="fast", )

print("\n\n".join([str(el) for el in elements])) print(f"n_of_elements: {len(elements)}")


**Expected behavior**
The PDF text extraction output should be the same.

**Additional context**
The logic for extracting PDF text using `pdfminer` was implemented separately in both repositories.
christinestraub commented 11 months ago

Completed via #2158