SuffolkLITLab / FormFyxer

A tool for learning about and pre-processing forms
MIT License
11 stars 1 forks source link

Add suggestions to remove complex and gendered terms from text of forms #101

Closed nonprofittechy closed 1 year ago

nonprofittechy commented 1 year ago

Fix #97 Fix #98

I add a generalized function, substitute_phrases which accepts a table of bad: good phrases and a sentence and replaces the "bad" terms with "good" ones. Small implementations for plainlanguage.gov and for a list of gendered terms that I found. Both required a fair amount of editing because our tool doesn't have any judgment or context. This lays ground work to do more kinds of plain language glossary kind of substitutions with bigger word lists, but it seems to be performing pretty well already.

This also provides a hook for a version of highlight_text in https://github.com/SuffolkLITLab/docassemble-PDFStats that actually works too. I adjusted the passive voice detection function to also provide its responses in a compatible format for the highlight_text function.

One thing to keep in mind when adding new terms is that they need to be ordered in the YAML file from longest to shortest.

BryceStevenWilley commented 1 year ago

One thing to keep in mind when adding new terms is that they need to be ordered in the YAML file from longest to shortest.

~Is that the case? I get what you're going for, but I don't think that's how the function is implemented.~ the function has other sorting problems, see https://github.com/SuffolkLITLab/FormFyxer/pull/101#discussion_r1164655505. ~If it is necessary that everything be order~, since it is necessary for everything to be in length order, IMO we should sort in the code when we read it in; there shouldn't be the chance for something to be wrong at runtime because we made a mistake in the data listing. Additionally, it is confusing for one list to be sorted in order but not the other, as plocket noted.

nonprofittechy commented 1 year ago

I can reorder both input files by length, it's just a pain to do in the editor. Short Python script I guess.

I'm not sure if we do want the startup cost of reordering every time we load the file if it's already correct to start. I can see that would make maintenance easier but it will slow down every page visit.

BryceStevenWilley commented 1 year ago

The python to sort the file is just reversed(sorted(things_to_sort, key=len)). It takes ~15 microseconds to sort the list, which is fine. Opening and reading the file slows things down much more than sorting it.