SuffolkLITLab / FormFyxer

A tool for learning about and pre-processing forms
MIT License
11 stars 1 forks source link

Feature to analyze vocabulary? #60

Closed nonprofittechy closed 2 years ago

nonprofittechy commented 2 years ago

We do a readability analysis, but would it also be helpful to break that down into a vocabulary analysis, like the kind done here: http://www4.caes.hku.hk/vocabulary/profile.htm which is used to determine if any words are in the top 1000, 2000, or academic word lists?

I didn't quickly find a relevant Python module that does this. However, one question that you might be able to answer, @BryceStevenWilley is whether this is totally redundant with Dale-Chall. It seems like it answers something a little more granular to me though.

BryceStevenWilley commented 2 years ago

I'd say the bulk is redundant with Dale-Chall. I hadn't heard of the Academic/Universitys word lists, but they seem to mostly be the next most complicated tier.

Caroline had mentioned her interest in making something equivalent to REALM (full list) for the legal field. This isn't exactly the same thing, but a starting point for that would be to gather a list of legal terms, both Latin and English equivalents, that would be a great list to compare to.

Any of these lists would be pretty simple python to write, doing either a slow "check the whole set to see if it's in said list" or slightly better "check a built trie from this list". Deciding the lists is the bulk of the thinking and work IMO.