New reverse-existence implementation

ferdnyc commented 6 months ago

This PR formally submits the implementation of proselint.tools.reverse_existence_check which I describe in https://github.com/amperser/proselint/issues/1334#issuecomment-2100920036. It still builds on the great work @vqle23 did in #1351 (in fact, it builds on the branch from that PR, rebased to current main and with my changes added as followup commits.)

A few things have changed from what I described in that issue comment:

While the utility function in proselint.tools is still named reverse_existence_check(), I renamed the actual checkers to restricted.top1000 and restricted.elementary, as I felt that was a less unwieldy title/category name that doesn't sacrifice any accuracy/descriptiveness.
I tweaked the word regex slightly, to avoid capturing hyphens or apostrophes on either extreme of the word. (So, my arbitrarily-chosen 3-character minimum length turned out to be prescient, as the regex ended up looking like this: re.compile(r"\w[\w'-]+\w")
Both versions of the allowed_word helper function (case-sensitive and not) are now defined statically in the tools.py file. The proper one is selected and bound with a functools.partial() at the start of each reverse_existence_check() call.

(...And I just noticed, I left some type annotations on the arguments to the helpers, that I'll probably have to take out to appease older Pythons. Bother.)

Fixes: #1334

ferdnyc commented 6 months ago

@Nytelife26

Oh, one thing I didn't address from the previous PR is your comment about the size of the word lists in the code.

I suppose we could move them to a JSON file, and json.load() the list in when the check runs. Possibly even a compressed JSON file.

To retain compatibility with zipfile packaging and the like, it'd probably be advisable to use importlib.resources to access the data (and make the backport a dependency for older Pythons) rather than relying on there being a filesystem to access. But that's not that big a deal, it's a solved problem at this point.

ferdnyc commented 6 months ago

I just sketched out a quick attempt at offloading the top1000 word list into a JSON file. Uncompressed, top1000.json would be 8.9k, and reduce the size of top1000.py from 14k to 850 bytes. (That's including the code to load the word list from the file.)

ferdnyc commented 6 months ago

elementary.json would be 7.5k, and reduce elementary.py from 12k to around the same 850 bytes.

Nytelife26 commented 6 months ago

i appreciate your continued efforts to go the extra mile. since it's really just a list, wouldn't a csv or similar be more efficient than json?

Nytelife26 commented 6 months ago

it'd probably be advisable to use importlib.resources to access the data (and make the backport a dependency for older Pythons)

since importlib.resources was added in Python 3.7, and the oldest supported version of Python is 3.8, we don't need to worry about backports. i would be comfortable with using importlib if you think it'd be more compatible for packaging.

ferdnyc commented 6 months ago

@Nytelife26 Unfortunately, the modern interface to importlib.resources (importlib.resources.file()) didn't come along until Python 3.9. But there's the importlib-resources backport package that can pull it in for Python 3.8.

It's just a matter of adding a dependency (I can never remember the syntax..):

importlib-resources = "^6.0"; python_version < "3.9"
# Wait, actually I think it's...
importlib-resources = "^6.0; python_version < 3.9"

ferdnyc commented 6 months ago

Hmm, turning the list into a .csv file with zero structure whatsoever makes top1000.csv 5.9k, as opposed to the JSON file's 8.9k.

That's for a file that looks like this, which I hate (but I don't have to read it, Python does):

a,able,about,above,accept,across,act,actually,add,admit,afraid,after,afternoon,again,against,age,ago,agree,ah,ahead,air,all,allow,almost,alone,along,already,alright,also,although,always,am,amaze,an,an

The JSON file looks like this, technically I could squeeze it tighter by telling it to lose the spaces between each item:

["a", "able", "about", "above", "accept", "across", "act", "actually", "add", "admit", "afraid", "after", "afternoon", "again", "against", "age", "ago", "agree", "ah", "ahead", "air", "all", "allow",

Going with CSV also means dealing with csv.py and the csv.reader(), which as standard library modules go is abominable. json.load() is simplicity itself. But, it's just 2-4 lines of code either way, so NBD. And the file is a lot smaller.

ferdnyc commented 6 months ago

It's just a matter of adding a dependency (I can never remember the syntax..):
importlib-resources = "^6.0"; python_version < "3.9"
# Wait, actually I think it's...
importlib-resources = "^6.0; python_version < 3.9"

I was doubly wrong, with poetry it's apparently:

importlib-resources = { version = "^6.0", python = "<3.9" }

Nytelife26 commented 6 months ago

That's for a file that looks like this, which I hate (but I don't have to read it, Python does)

using newlines as field delimiters instead of commas would improve readability and make diffs easier, without changing file size, if that helps.

you could argue with the validity of this, because it's comma separated values not newline separated values, but i would argue it makes sense to imagine them as records of one "words" field.

ferdnyc commented 6 months ago

Hmm, I suppose that'd make csv.reader() happier too, right now it reads the whole row in as a single list, which it then returns inside a one-item list (because it's a list of rows and there's only one row). That way the result would just be the list.

Nytelife26 commented 6 months ago

sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.

ferdnyc commented 6 months ago

sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.

Yeah, I ended up having to do this nonsense which is... whatever. I hate csv.py with a passion.

with files(proselint).joinpath(_CSV_PATH).open('r') as data:
    reader = csv.reader(data)
    wordlist = list()
    for row in reader:
        wordlist.extend(row)

EDIT: (You should've seen me trying to write the files, I kept ending up with output that looked like this...)

"a"
"a","b","l","e"
"a","b","o","u","t"

...And then cursing. A lot.

ferdnyc commented 6 months ago

@Nytelife26 I just realized I have no idea if those CSV files will actually be packaged into the source and/or binary distributions. I'll defer to your expertise on that one if that's OK. I tried to run poetry locally, it tried to get me to authenticate for something, then crashed my ssh-agent completely. So now I'm stuck typing my passphrase for every git transaction until I fix its mess. :roll_eyes:

Nytelife26 commented 6 months ago

it would be nice if python had a flatten function, like pretty much every other language i use, that's for sure. my latest commits should ease your pain somewhat.

also, having just tested the build system locally, and reviewing the configuration, all files in proselint/ including the new CSVs are included in the build.

Nytelife26 commented 6 months ago

i'm ready to merge this if you have no further comments, suggestions or changes to make.

ferdnyc commented 6 months ago

@Nytelife26 All looks good to me, thanks for all your help with this!

amperser / proselint

New reverse-existence implementation #1370