Open EdwardBetts opened 6 years ago
Sure, feel free to add it
Thanks @larsoner. I couldn't find the an existing mechanism to specify multi word exceptions like this. Does it exist and I've missed it, or would it need to be added to handle this case?
Usually people just add the word to the list like:
fro->for, fro
Right @luzpaz ?
(I don't think there is a multi-word way in particular, I think it's okay to just use this single-word one in the meantime)
~Actually, you're missing a comma at the end:
fro->for, fro,
~
Sorry, did you mean something like:
https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary.txt#L655
fro->for, fro is correct if it's in the context of 'to and fro'
something like that ?
So I've just sort of hit another of these, preform->perform. Well actually in this case it was a typo, but preform is a valid word too. But there are others like Cristal.
cristal->crystal, cristal,
Essentially we currently have some entries where the misspelling is also listed as a valid correction. There are two trains of thought here I guess, firstly it's nonsense and we should remove the entry, or more considered, that cristal is most likely a misspelling, but in some (rare) circumstances you may really mean it (see also fro, until we support multi-word corrections #255).
What's the general feeling? I've not used interactive mode, but there a "did you really mean?", and in manual/automatic the option to skip/ignore all potential ones seems like it might be sensible. Essentially treat likely typos differently from definite typos.
I've not used interactive mode, but there a "did you really mean?",
That is more or less what these entries seem to do. I agree we could add a parameter to be more suggestive (default) or less suggestive (if the "error" is in the list of corrections, don't prompt or report)
I guess one of the things I've always liked about Codespell is the fact the dictionary is curated, rather than a list of valid words, and hence doesn't normally trip up on valid but obscure/technical words. It sort of feels to me that it goes against that ethos when words are added to the dictionary which are valid (although admittedly mostly rarely used). Even if the other variant (i.e. the "typo") is listed, but even more so when it's not.
We probably need a new argument for this, that (I agree) should disable these by default.
--strict LEVEL
, where for now 0 means include all and 1 (default) means exclude such self corrections? That leaves the option open for other sorts of strictness types later.
If so, any volunteer to implement this?
Would a bitmask make more sense, in case someone doesn't want one of the future strict levels? I'll have to pass on implementing it for now.
It probably also needs a dictionary test that we don't just have wit->wit
, i.e. there is at least one alternative replacement.
Yes the idea is to use binary values so it would be a bitmask (it's just a trivial, future compatible one for now)
Not that this covers multi-word examples, but the other rare stuff can go in https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary_rare.txt
I am to overcome this limitation as observed in https://framagit.org/medoc92/recoll/-/merge_requests/23#note_1999939 via
ignore-regex = \bto and fro\b
I think it would be valuable to collect/support presence of such phrases (I can't recall ATM any other but remember hitting them) which should be whitelisted although individual words (fro
) should be considered a typo.
A good idea indeed, however:
Item 1 looks like the most complex to address – but then I haven't done my homework. Nowadays, I am not certain it is useful to start supporting n-grams without using deep learning to process them.
couldn't it be just pretty much "pre-feed ignore-regex with all the phrases surrounded with \b
"?
Do you mean you would add an invisible backspace to your text, just to please codespell?
no, I mean that codespell
could just pre-craft regex for all the phrases. \b
in Python re
module is a word boundary:
\b
Matches the empty string, but only at the beginning or end of a word.
So codespell would apply a limited set of regexes for very common such expressions, prior to splitting the text into words, removing the matched words from further checks. I suspect this would have a perceptible impact on performance, but I can't tell how maintainers would react it.
codespell suggests replacing "fro" with "for". Can we have an exception for the phrase "to and fro"?
https://en.wiktionary.org/wiki/to_and_fro