Closed nonprofittechy closed 1 year ago
I don't think there's any consistency in form titles, any heuristic we make will be better than nothing I guess, but still ends up a 50-50 shot of the title vs something on every form.
A small sample of 2 states that do have consistency in all of the statewide forms:
MA forms have titles that clearly are the first thing on the page:
IL forms however, have centered titles, and are on the 2nd or 3rd line of text down.
Without being able to compare against other forms from the jurisdiction, not sure what we could do. I guess a filter on all text lines that have the state name in them might help, and if they all have the state name in them, then just guess the shortest one.
We had a simple "use first line" if no name if provided heuristic. However, I just added and a function guess_form_name()
which uses GPT-3 to propose a name based on the full text of the form. I haven't done extensive testing, but it did okay on the one random form I tested. I think it passed the better-than-nothing test. So, I went ahead and integrated it with the parse_form()
function. See https://github.com/SuffolkLITLab/FormFyxer/commit/416d51f2547153af95be315ff2d49f51dada6bee
It would be nice to have an API to get the title of a form even if it's not present in the metadata around a form on the page.