codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.89k stars 467 forks source link

"to and fro" is correct #410

Open EdwardBetts opened 6 years ago

EdwardBetts commented 6 years ago

codespell suggests replacing "fro" with "for". Can we have an exception for the phrase "to and fro"?

https://en.wiktionary.org/wiki/to_and_fro

larsoner commented 6 years ago

Sure, feel free to add it

EdwardBetts commented 6 years ago

Thanks @larsoner. I couldn't find the an existing mechanism to specify multi word exceptions like this. Does it exist and I've missed it, or would it need to be added to handle this case?

larsoner commented 6 years ago

Usually people just add the word to the list like:

fro->for, fro

Right @luzpaz ?

larsoner commented 6 years ago

(I don't think there is a multi-word way in particular, I think it's okay to just use this single-word one in the meantime)

luzpaz commented 6 years ago

~Actually, you're missing a comma at the end:
fro->for, fro,~ Sorry, did you mean something like: https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary.txt#L655 fro->for, fro is correct if it's in the context of 'to and fro' something like that ?

peternewman commented 6 years ago

So I've just sort of hit another of these, preform->perform. Well actually in this case it was a typo, but preform is a valid word too. But there are others like Cristal.

cristal->crystal, cristal,

Essentially we currently have some entries where the misspelling is also listed as a valid correction. There are two trains of thought here I guess, firstly it's nonsense and we should remove the entry, or more considered, that cristal is most likely a misspelling, but in some (rare) circumstances you may really mean it (see also fro, until we support multi-word corrections #255).

What's the general feeling? I've not used interactive mode, but there a "did you really mean?", and in manual/automatic the option to skip/ignore all potential ones seems like it might be sensible. Essentially treat likely typos differently from definite typos.

larsoner commented 6 years ago

I've not used interactive mode, but there a "did you really mean?",

That is more or less what these entries seem to do. I agree we could add a parameter to be more suggestive (default) or less suggestive (if the "error" is in the list of corrections, don't prompt or report)

peternewman commented 6 years ago

I guess one of the things I've always liked about Codespell is the fact the dictionary is curated, rather than a list of valid words, and hence doesn't normally trip up on valid but obscure/technical words. It sort of feels to me that it goes against that ethos when words are added to the dictionary which are valid (although admittedly mostly rarely used). Even if the other variant (i.e. the "typo") is listed, but even more so when it's not.

larsoner commented 5 years ago

We probably need a new argument for this, that (I agree) should disable these by default.

--strict LEVEL, where for now 0 means include all and 1 (default) means exclude such self corrections? That leaves the option open for other sorts of strictness types later.

larsoner commented 5 years ago

If so, any volunteer to implement this?

peternewman commented 5 years ago

Would a bitmask make more sense, in case someone doesn't want one of the future strict levels? I'll have to pass on implementing it for now.

It probably also needs a dictionary test that we don't just have wit->wit, i.e. there is at least one alternative replacement.

larsoner commented 5 years ago

Yes the idea is to use binary values so it would be a bitmask (it's just a trivial, future compatible one for now)

peternewman commented 4 years ago

Not that this covers multi-word examples, but the other rare stuff can go in https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary_rare.txt

yarikoptic commented 1 year ago

I am to overcome this limitation as observed in https://framagit.org/medoc92/recoll/-/merge_requests/23#note_1999939 via

ignore-regex = \bto and fro\b

I think it would be valuable to collect/support presence of such phrases (I can't recall ATM any other but remember hitting them) which should be whitelisted although individual words (fro) should be considered a typo.

DimitriPapadopoulos commented 1 year ago

A good idea indeed, however:

  1. Currently codespell splits text into words before processing them. It does not support n-grams.
  2. The dictionary of typos does not support spaces in possible typos.

Item 1 looks like the most complex to address – but then I haven't done my homework. Nowadays, I am not certain it is useful to start supporting n-grams without using deep learning to process them.

yarikoptic commented 1 year ago

couldn't it be just pretty much "pre-feed ignore-regex with all the phrases surrounded with \b"?

DimitriPapadopoulos commented 1 year ago

Do you mean you would add an invisible backspace to your text, just to please codespell?

yarikoptic commented 1 year ago

no, I mean that codespell could just pre-craft regex for all the phrases. \b in Python re module is a word boundary:

\b
Matches the empty string, but only at the beginning or end of a word.
DimitriPapadopoulos commented 1 year ago

So codespell would apply a limited set of regexes for very common such expressions, prior to splitting the text into words, removing the matched words from further checks. I suspect this would have a perceptible impact on performance, but I can't tell how maintainers would react it.