SuffolkLITLab / FormFyxer

A tool for learning about and pre-processing forms
MIT License
11 stars 1 forks source link

Can we use heuristics to guess a form's title? #76

Closed nonprofittechy closed 1 year ago

nonprofittechy commented 2 years ago

It would be nice to have an API to get the title of a form even if it's not present in the metadata around a form on the page.

BryceStevenWilley commented 2 years ago

I don't think there's any consistency in form titles, any heuristic we make will be better than nothing I guess, but still ends up a 50-50 shot of the title vs something on every form.

A small sample of 2 states that do have consistency in all of the statewide forms:

MA forms have titles that clearly are the first thing on the page: Screenshot from 2022-12-01 16-18-49

IL forms however, have centered titles, and are on the 2nd or 3rd line of text down. Screenshot from 2022-12-01 16-16-57

Without being able to compare against other forms from the jurisdiction, not sure what we could do. I guess a filter on all text lines that have the state name in them might help, and if they all have the state name in them, then just guess the shortest one.

colarusso commented 1 year ago

We had a simple "use first line" if no name if provided heuristic. However, I just added and a function guess_form_name() which uses GPT-3 to propose a name based on the full text of the form. I haven't done extensive testing, but it did okay on the one random form I tested. I think it passed the better-than-nothing test. So, I went ahead and integrated it with the parse_form() function. See https://github.com/SuffolkLITLab/FormFyxer/commit/416d51f2547153af95be315ff2d49f51dada6bee