amperser / proselint

A linter for prose.
http://proselint.com
BSD 3-Clause "New" or "Revised" License
4.31k stars 177 forks source link

Update er_or.py - remove redundant 'promoter'. #1372

Open davidhicks980 opened 1 month ago

davidhicks980 commented 1 month ago

Please reject this PR if this is not an issue.

Nytelife26 commented 1 month ago

@ferdnyc it seems evident we may not have caught them all. i'll have to review more comprehensively

ferdnyc commented 1 month ago

@Nytelife26 Mmm, my scan was admittedly less-than-comprehensive, and would've easily been thrown off by (for example) differences in indentation or formatting. (It used sort and uniq in the shell to detect dupes.)

ferdnyc commented 1 month ago

What I might suggest is to forge ahead with the reformatting of all the checkers into my proposed form:

TERM_LIST = [
  # ...
]

def some_checker(...):
    # ...

At which point, dupes can easily be scanned for by importing each list from its module.

Probably about the same amount of work as scanning the lists in their current form without breaking them out of the functions.

orgua commented 1 month ago

@ferdnyc keep in mind that there is a massive overhaul coming (#1361 and #1371). File structure and check-design are changing significantly.

ferdnyc commented 1 month ago

@orgua

keep in mind that there is a massive overhaul coming

That's a good point; looking at the changes in #1361, I don't see anything that's incompatible with extracting the lists as importable entities, but it also doesn't do that, and I'm still a strong proponent of taking that step. The benefits of being able to from proselint.checks.foo.bar import BAR_TERMS and manipulate the list of terms as a discrete object can't be overstated. Easy deduplication, easy statistical analysis, etc, etc.

Making that change outside of 1361 would certainly be incredibly disruptive, and make it far harder to merge that PR. Doing it as a post-1361 change would largely require waiting for it to be merged, first. (Assuming one wants to preserve one's sanity by not coding against a moving target like an open PR.)

By the same token, even merging this PR slightly disrupts #1361; if the intent is to get that in, it might be easier to suspend all other changes and focus on getting just the big'un over the finish line. IMHAndUnsolicitedO.

orgua commented 1 month ago

yes, disruptive. I agree. Unfortunately @Nytelife26 is still merging smaller PRs before that large overhaul and thus making rebasing an operation that is easily susceptible to human error and hard to verify.

Arguments against the "global" list are:

The testsuite is already improved and checks for some duplication, but not entries between different checks. Some doubles might be wanted though. When you delete them in one test and disable the other via configuration you would miss warning. The other approach would be to do it JIT for performance reasons. But also that might be more expensive than just rechecking these doubled entries.

Note: You can access list-variables that are inside functions in imported modules if you just want to analyze things.

But beside that it would be beneficial to further overhaul the check-files - add more machine readable metadata like: enabled, preview, language, ... This was proposed by me in #1362 and would get rid of the hardcoded config-file (code duplication). Additional to that change it would be possible to also add a checktype-metadata and just have that list that you mentioned. At least it would be possible for the most common existence check.

Nytelife26 commented 1 month ago

@ferdnyc keep in mind that there is a massive overhaul coming (#1361 and #1371). File structure and check-design are changing significantly.

if you plan on sticking around to continue working on it with me, i can close #1371 and we can pick up where we left off. it's up to you whether or not you wish to continue on this, or whether you'd prefer to leave the rest to me. in any case, i greatly appreciate the efforts you've put into it. you've been massively helpful in setting out a clear, definitive way for proselint to have a future.

Nytelife26 commented 1 month ago

yes, disruptive. I agree. Unfortunately @Nytelife26 is still merging smaller PRs before that large overhaul and thus making rebasing an operation that is easily susceptible to human error and hard to verify.

i have learned my lesson after the first time. i plan on resolving the created conflicts with a simple merge instead of a rebase, to make it easier, instead of working with a huge number of changes over hundreds of files. alternatively, i can set the rebase mode to override everything else with the PR changes, but we'll see.

i agree with you and @ferdnyc. i will ultimately suspend all further developments until the refactor is complete.

ferdnyc commented 1 month ago

Oh, man, I had a very long reply typed up here that naturally got lost in a Chrome crash. But maybe that's for the best, you all didn't really need to be subjected to all of that and this conversation should really be in an Issue or Discussion anyway. But I will just say:

@orgua

Arguments against the "global" list are:

  • it was slightly slower in my tests
  • some list are generated dynamically, so this wouldn't be a universal approach (but that's fixable by using the function-output)

On the speed issue, I can't believe there would be any significant difference, and my own tests don't bear that out. It's a reference to a list, which is immediately being passed to another function. Whether that reference points to a function-scope symbol or a module-scope symbol can't possibly make any appreciable difference.

The only scenario I can come up with, where module-level variables might be appreciably slower, is when you're running with a lot of checks disabled... and even then, only if proselint is still importing all of the check modules (even the disabled ones). But in that case I'd say that the right fix is to make the import code smarter and avoid importing anything it doesn't need.

Generated lists are possibly the most important to hoist out of the function, I'd say, because why should they be regenerated every time it's called? The resulting lists being generated are static in nearly all cases, so it makes sense to store them somewhere for reuse.

In fact, for some of the generated lists, where the actual generation is just a matter of convenience/laziness, if speed is really a concern then those checks could be optimized by simply running the generator function in a terminal and dumping the output into the code as a static list.

(I'm talking about things like in security/password.py...): https://github.com/amperser/proselint/blob/94a1eea991ffc171a28115523876add890f78225/proselint/checks/security/password.py#L23-L31

...I don't think the runtime of those kinds of operations is significant enough to be worth worrying about (not even the significantly bigger list generated in misc/waxing.py misc/waxed.py), but the option is there.

Note: You can access list-variables that are inside functions in imported modules if you just want to analyze things.

How? If there's a way other than by digging around in its __code__ object (or having the inspect module do it for you), I don't know it. And __code__ spelunking is tricky enough to do by hand, even harder to do programmatically. The symbols end up in weird and not-always-predictable places.

Plus, that still doesn't get you access to the generated lists, because the function has to be executed for those variables to exist.

orgua commented 1 month ago

some minimal feedback:

i think we are on the same page. to grow this project needs better software quality

Nytelife26 commented 1 month ago

@orgua

  • These generated lists should be best assembled but during compile time, like constexpr. Python hasn't something even close to it. Porting the core functionality to rust was my far stretched goal ...

i was considering this for a while. even typescript would be an improvement if you ask me. the major issue there is it's a large project and a rust rewrite would be a proportional undertaking (and import dynamics work a lot differently).

while you're here, would you mind answering my previous comment?

if you plan on sticking around to continue working on it with me, i can close #1371 and we can pick up where we left off. it's up to you whether or not you wish to continue on this, or whether you'd prefer to leave the rest to me.

it would be helpful to know if it's worth proceeding with my own efforts or if i should wait for you to have more time so we can continue collaborating properly.

orgua commented 1 month ago

proselint is large? i wouldn't even call it medium :) There are some base algos and the rest is gluecode and definitions of checks.

@Nytelife26 do you have access to your last mail address? there should be two mails waiting. i might have time - but lets discuss that elsewhere.

Nytelife26 commented 1 month ago

@Nytelife26 do you have access to your last mail address? there should be two mails waiting.

my apologies, i tend not to check my university email much outside of term dates. i have replied, so hopefully progress awaits.

Nytelife26 commented 1 month ago

@orgua just a ping to make sure you're aware we've set a time. let me know if you can make it.