amperser / proselint

A linter for prose.
http://proselint.com
BSD 3-Clause "New" or "Revised" License
4.31k stars 177 forks source link

checks: Remove duplicate items from lists #1369

Closed ferdnyc closed 1 month ago

ferdnyc commented 2 months ago

While reading through one of the proselint/checks/*/misc.py files, I happened to notice a duplicate item on its list of matching terms.

So I ran a quick script to look for duplicate items in all of the misc.py files, spotted a couple more that way, and removed those as well.

Redundancies NOT fixed

Both cliches/misc.py and redundancy/misc.py contain several more duplicates, by virtue of the same term appearing on multiple different lists of cliches/redundancies — those, I didn't touch. This is only about duplicates on the same list.

(Plus, the irony of having redundant terms in the redundancies check is inherently entertaining. Enough to justify leaving those alone anyway.)

ferdnyc commented 2 months ago

@Nytelife26

i will indeed go through and remove duplicates in other lists as part of this.

Well, my thinking in leaving them was: Where the lists come from different external sources (as is the case in all of those left-over duplicates), how do you choose which one keeps the term, and which one doesn't? (And then either way, one list is no longer in sync with its source.)

ferdnyc commented 2 months ago

@Nytelife26

But if you want the lists of remaining, cross-list duplicates, they are:

cliches/misc.py:

        "between a rock and a hard place",
        "conspicuous by its absence",
        "crystal clear",
        "easier said than done",
        "gilding the lily",
        "horns of a dilemma",
        "last but not least",
        "moment of truth",
        "par for the course",
        "thick as thieves",

redundancy/misc.py:

        ["blend",             ["blend together"]],
        ["Challah",           ["Challah bread"]],
        ["collaborate",       ["collaborate together"]],
        ["combine",           ["combine together"]],
        ["cooperation",       ["mutual cooperation"]],
        ["fundamentals",      ["basic fundamentals"]],
        ["gift",              ["free gift"]],
        ["innovation",        ["new innovation"]],
        ["merge",             ["merge together"]],
        ["mix",               ["mix together"]],
        ["mutual respect",    ["mutual respect for each other"]],
        ["plans",             ["future plans"]],
        ["potable water",     ["potable drinking water"]],
        ["refer",             ["refer back"]],
        ["retreat",           ["retreat back"]],
        ["surrounded",        ["surrounded on all sides"]],
        ["veteran",           ["former veteran"]],
        ["visible",           ["visible to the eye"]],
Nytelife26 commented 2 months ago

how do you choose which one keeps the term, and which one doesn't?

i am currently waiting to accrue the funds to obtain the sources myself (my university library does not have them unfortunately), which would make things easier, but you're right.

i was considering aggregating the lists so they run by type of check rather than by source, and using comments to keep track, but that might get messy.

i'll come up with something.

ferdnyc commented 2 months ago

@Nytelife26

i was considering aggregating the lists so they run by type of check rather than by source, and using comments to keep track, but that might get messy.

That's certainly an option. I mentioned in #1334 that in testing my implementation of reverse-existence, I'd factored the lists of terms out of the check functions so that I could access them externally, specifically in the REPL when I was testing/debugging/benchmarking interactively.

That's useful enough that I'd suggest it as an across-the-board change for all of the checking functions — instead of holding their lists internally, move them out to module-level variables which can be accessed from other code.

So, where now the "template" is:

def some_check_fn(args):
    terms = [
      "foo",
      "bar",
      "baz",
    ]
    return some_tool_function(text, terms, blah, blah)

My suggestion would be to go with:


SOME_CHECK_TERMS = [
  "foo",
  "bar",
  "baz",
]

def some_check_fn(args):
    terms = SOME_CHECK_TERMS
    return some_tool_function(text, terms, blah, blah)

The major advantage there is that you can use from proselint.checks.checker.misc import SOME_CHECK_TERMS to get at that list of terms from anywhere in the codebase — from any codebase, really, even outside of proselint.

But it would also permit combining multiple lists together1 and de-duplicating them, e.g.:

FIRST_SOURCE_TERMS = [...]
SECOND_SOURCE_TERMS = [...]

def some_check_fn():
    terms = list(set(FIRST_SOURCE_TERMS + SECOND_SOURCE_TERMS))

   return some_tool_function(text, terms, blah, blah)

Do that, it doesn't matter if the source lists contain redundancies, they'll be dropped in building the final terms list.

Notes

  1. (I like to think I used "combine... together" here only because it was on the duplicate-redundancies list I'd just posted, and seeing it there meant it was on my mind as I was writing this. Because the alternative is that I write tedious, redundant prose.)
Nytelife26 commented 1 month ago

i like that solution, actually. it could simplify testing, too. i think the goal should eventually be to refactor checks to merely return a list of regex-enabled sequences along with their options, and run the existence check once so it can search the source for everything in as few passes as possible, like what you did with reverse_existence_check in #1370.