Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.39k stars 573 forks source link

bug/Two Column PDF partition result in incorrect text. #3325

Open pfcharles opened 2 days ago

pfcharles commented 2 days ago

Describe the bug When running partition on a two column pdf, text extraction puts characters is the wrong position To Reproduce two_col.pdf

Provide a code snippet that reproduces the issue. elements = partition("two_col.pdf", strategy="fast")

text attribute of elements[2] = '1. Exchange of Information. The parties agree to exchange Confidential Information for the purpose of (the evaluating a potential business "Purpose") in accordance with this Agreement.' text attribute of elements[3] = 'relationship'

Actually text from the pdf = '1.Exchange of Information. The parties agree to exchange Confidential Information for the purpose of evaluating a potential business relationship (the "Purpose") in accordance with this Agreement.'

two_col.json

Expected behavior Extracted text matches the actual text

Screenshots image

Environment Info Please run python scripts/collect_env.py and paste the output here. OS version: macOS-14.5-arm64-arm-64bit Python version: 3.9.6 unstructured version: 0.14.9 unstructured-inference version: 0.7.36 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 is not installed PaddleOCR is not installed Libmagic version: file-5.41 magic file from /usr/share/file/magic LibreOffice version: ==> libreoffice: 24.2.4

Additional context Add any other context about the problem here.

pfcharles commented 2 days ago

Looks like this is a problem with the underling use of the pdfminer library. the data returned by the pdfminer.layout.LTTextBoxHorizontal object get_text() method in pdf.py is wrong.

pfcharles commented 1 day ago

two_col_not_justified.pdf

This appears to be related the document being text justified and there being larger spaces between words. The issue appears to be related to the implementation of find_neighbors in the pdfminer layout. To some extent this can be controlled by the LAParams initialized in init_pdfminer. Other libs like PyPDF and (java)PDFBox handle with no issue or special configuration.