Closed ferdnyc closed 6 months ago
@Nytelife26
Oh, one thing I didn't address from the previous PR is your comment about the size of the word lists in the code.
I suppose we could move them to a JSON file, and json.load()
the list in when the check runs. Possibly even a compressed JSON file.
To retain compatibility with zipfile packaging and the like, it'd probably be advisable to use importlib.resources
to access the data (and make the backport a dependency for older Pythons) rather than relying on there being a filesystem to access. But that's not that big a deal, it's a solved problem at this point.
I just sketched out a quick attempt at offloading the top1000 word list into a JSON file. Uncompressed, top1000.json
would be 8.9k, and reduce the size of top1000.py
from 14k to 850 bytes. (That's including the code to load the word list from the file.)
elementary.json
would be 7.5k, and reduce elementary.py
from 12k to around the same 850 bytes.
i appreciate your continued efforts to go the extra mile. since it's really just a list, wouldn't a csv or similar be more efficient than json?
it'd probably be advisable to use
importlib.resources
to access the data (and make the backport a dependency for older Pythons)
since importlib.resources
was added in Python 3.7, and the oldest supported version of Python is 3.8, we don't need to worry about backports. i would be comfortable with using importlib
if you think it'd be more compatible for packaging.
@Nytelife26 Unfortunately, the modern interface to importlib.resources
(importlib.resources.file()
) didn't come along until Python 3.9. But there's the importlib-resources
backport package that can pull it in for Python 3.8.
It's just a matter of adding a dependency (I can never remember the syntax..):
importlib-resources = "^6.0"; python_version < "3.9"
# Wait, actually I think it's...
importlib-resources = "^6.0; python_version < 3.9"
Hmm, turning the list into a .csv
file with zero structure whatsoever makes top1000.csv
5.9k, as opposed to the JSON file's 8.9k.
That's for a file that looks like this, which I hate (but I don't have to read it, Python does):
a,able,about,above,accept,across,act,actually,add,admit,afraid,after,afternoon,again,against,age,ago,agree,ah,ahead,air,all,allow,almost,alone,along,already,alright,also,although,always,am,amaze,an,an
The JSON file looks like this, technically I could squeeze it tighter by telling it to lose the spaces between each item:
["a", "able", "about", "above", "accept", "across", "act", "actually", "add", "admit", "afraid", "after", "afternoon", "again", "against", "age", "ago", "agree", "ah", "ahead", "air", "all", "allow",
Going with CSV also means dealing with csv.py
and the csv.reader()
, which as standard library modules go is abominable. json.load()
is simplicity itself. But, it's just 2-4 lines of code either way, so NBD. And the file is a lot smaller.
It's just a matter of adding a dependency (I can never remember the syntax..):
importlib-resources = "^6.0"; python_version < "3.9" # Wait, actually I think it's... importlib-resources = "^6.0; python_version < 3.9"
I was doubly wrong, with poetry it's apparently:
importlib-resources = { version = "^6.0", python = "<3.9" }
That's for a file that looks like this, which I hate (but I don't have to read it, Python does)
using newlines as field delimiters instead of commas would improve readability and make diffs easier, without changing file size, if that helps.
you could argue with the validity of this, because it's comma separated values not newline separated values, but i would argue it makes sense to imagine them as records of one "words" field.
Hmm, I suppose that'd make csv.reader()
happier too, right now it reads the whole row in as a single list, which it then returns inside a one-item list (because it's a list of rows and there's only one row). That way the result would just be the list.
sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.
sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.
Yeah, I ended up having to do this nonsense which is... whatever. I hate csv.py
with a passion.
with files(proselint).joinpath(_CSV_PATH).open('r') as data:
reader = csv.reader(data)
wordlist = list()
for row in reader:
wordlist.extend(row)
EDIT: (You should've seen me trying to write the files, I kept ending up with output that looked like this...)
"a"
"a","b","l","e"
"a","b","o","u","t"
...And then cursing. A lot.
@Nytelife26 I just realized I have no idea if those CSV files will actually be packaged into the source and/or binary distributions. I'll defer to your expertise on that one if that's OK. I tried to run poetry locally, it tried to get me to authenticate for something, then crashed my ssh-agent completely. So now I'm stuck typing my passphrase for every git transaction until I fix its mess. :roll_eyes:
it would be nice if python had a flatten function, like pretty much every other language i use, that's for sure. my latest commits should ease your pain somewhat.
also, having just tested the build system locally, and reviewing the configuration, all files in proselint/
including the new CSVs are included in the build.
i'm ready to merge this if you have no further comments, suggestions or changes to make.
@Nytelife26 All looks good to me, thanks for all your help with this!
This PR formally submits the implementation of
proselint.tools.reverse_existence_check
which I describe in https://github.com/amperser/proselint/issues/1334#issuecomment-2100920036. It still builds on the great work @vqle23 did in #1351 (in fact, it builds on the branch from that PR, rebased to currentmain
and with my changes added as followup commits.)A few things have changed from what I described in that issue comment:
proselint.tools
is still namedreverse_existence_check()
, I renamed the actual checkers torestricted.top1000
andrestricted.elementary
, as I felt that was a less unwieldy title/category name that doesn't sacrifice any accuracy/descriptiveness.re.compile(r"\w[\w'-]+\w")
allowed_word
helper function (case-sensitive and not) are now defined statically in thetools.py
file. The proper one is selected and bound with afunctools.partial()
at the start of eachreverse_existence_check()
call.(...And I just noticed, I left some type annotations on the arguments to the helpers, that I'll probably have to take out to appease older Pythons. Bother.)
Fixes: #1334