SuffolkLITLab / FormFyxer

A tool for learning about and pre-processing forms
MIT License
11 stars 1 forks source link

Actually grab all of the text in Figures #92

Closed BryceStevenWilley closed 1 year ago

BryceStevenWilley commented 1 year ago

Several forms (Affidavit of Indigency) have the entire contents of the PDF inside a PDF Figure for some reason. This means pdfminer doesn't get any text.

This fixes that by using the all_text parameter to look in Figures, and by recursively unnesting the boxes in figures to get just the horizontal text lines.

Will merge after #91 is merged.