refactor: clean up checks

Nytelife26 commented 3 years ago

It has come to my attention that a lot of checks within proselint are dubious at best, or misguided. For instance:

"Christiana" being considered archaic - it is the name of a place. and of a riot, too.
The filter that quite literally just checks for text matching "the n-word" - what use is telling people not to use a way to refer to it, if we aren't telling people the harm done by the word itself?
The hundreds of phrases and words in the cursing.nfl check - some of them are just numbers, and others have many variations included, almost like a poorly-designed censoring system, in contrast with using regex.
The categorization of various corporate types as different - why have airlinese, corporate speak, etc in different categories? They should logically be in different subcategories under corporate jargon.
The same for LGBTQ and sexism - why not put both, and more, under a discriminative / exclusive language based module?

Et cetera. I feel it may be necessary to do a refactor of these checks and categorizations with a formal review to make maintainability easier in future and also to maintain a better linguistic ecosystem.

suchow commented 3 years ago

Good ideas, and I share many of your concerns. However, there are several distinct issues that you've bundled here in #1155 and they should be broken down into smaller issues that can be discussed and completed independently. Here are some possible standalone issues:

Refactoring the checks so that they are organized either by the source of the advice (e.g., David Foster Wallace) OR by the domain of the advice (e.g., hyperbole). In general we've been good about this, organizing checks by the domain of the advice, but there may be some vestiges of an earlier organizational scheme.
Determining the right categorization scheme for checks and groups of related checks (e.g., should airlinese be a subcategory of corporate jargon, should sexism and ageism be subcategories of a discriminatory-language module?)
Improving the archaism check to distinguish archaic vs. modern senses of the same character string.
Improving or perhaps deleting the nword.py check.
Improving or perhaps deleting the cursing.nfl check.
Crafting a principled approach to determining what makes a check dubious or misguided and applying that approach consistently across all of proselint, both retrospectively and going forward, perhaps defining it in a policy document.
Making sure that all the messages are informative.

suchow commented 3 years ago

Also, note that cursing.nfl defaults to off, probably for exactly the reason that it produces far too many false alarms in its current state to be useful:

https://github.com/amperser/proselint/blob/372ebf0253ddbf3c404e2f44bf1519fd6510b6ce/proselint/.proselintrc#L13

As a more general point, we've been wary of any checks that attempt to categorically ban words. The only time that's seemed like a good idea so far is for needless variants, where the determination has already been made for us that the word has no need.

Nytelife26 commented 3 years ago

they should be broken down into smaller issues that can be discussed and completed independently.

Strong suggestion, actually. Ultimately I just wanted to put this down as an RFC to get people's thoughts prior to doing any real work.

but there may be some vestiges of an earlier organizational scheme.

I believe so, as that's what we saw with the split between dfw.uncomparables (which didn't exist) and uncomparables.misc. I'll check through them if a cleanup like this does occur.

Determining the right categorization scheme for checks and groups of related checks

That would definitely be the right thing to do going forward I think. I foresee it making maintenance quite a lot easier, and will overall help people to understand the general scope of these checks better.

Improving the archaism check to distinguish archaic vs. modern senses of the same character string.

That likely falls under the same problem we discussed relating to flag-based parsing honestly.

Improving or perhaps deleting [cursing.nword and cursing.nfl].

I would suggest improving them rather than deleting altogether. Principally speaking, many of these things are genuinely words that should be avoided in most contexts, and if we can tighten the error margin much more and make them more definitive, they may very well be suitable for our usage.

Crafting a principled approach to determining what makes a check dubious or misguided and applying that approach consistently across all of proselint, both retrospectively and going forward, perhaps defining it in a policy document.

I would be more than happy to do this. Ultimately it would be good to concretely define and lay out our process for making these decisions and the criteria required for linguistic constructs. For part of this, we could use something similar to my language suitability evaluation framework

Making sure that all the messages are informative.

That would be quite an easy fix, too. Perhaps one best placed in the same restructure as a categorization evaluation.

cursing.nfl defaults to off

I wasn't aware of that, actually - thanks for the tip. I'll be sure to consider .proselintrc and our ability to set defaults in future.

we've been wary of any checks that attempt to categorically ban words

That's for the best, things like that can get authoritative or out of hand quite quickly. It's nice to see these things taken as seriously as they should be. It'll be easier for us to make those decisions once a framework is in place.

suchow commented 3 years ago

@Nytelife26 Thanks for the response, we're on the same page on every point :)

amperser / proselint

refactor: clean up checks #1155