print("\n\n".join([str(el) for el in elements]))
print(f"n_of_elements: {len(elements)}")
- unstructured repo
unstructured repo
elements = partition_pdf(
filename="algebra-graph-level1-1.pdf",
strategy="fast",
)
print("\n\n".join([str(el) for el in elements]))
print(f"n_of_elements: {len(elements)}")
**Expected behavior**
The PDF text extraction output should be the same.
**Additional context**
The logic for extracting PDF text using `pdfminer` was implemented separately in both repositories.
Describe the bug PDF text extraction by
pdfminer
works differently in unstructured repo and unstructured-inference repo.To Reproduce PDF: algebra-graph-level1-1.pdf
layout = process_file_with_model( filename="algebra-graph-level1-1.pdf", model_name=None, )
elements = layout.pages[0].elements
print("\n\n".join([str(el) for el in elements])) print(f"n_of_elements: {len(elements)}")
unstructured repo
elements = partition_pdf( filename="algebra-graph-level1-1.pdf", strategy="fast", )
print("\n\n".join([str(el) for el in elements])) print(f"n_of_elements: {len(elements)}")