codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.86k stars 471 forks source link

"Goverment" not regonized in all relevant lines #3380

Closed buhtz closed 1 month ago

buhtz commented 6 months ago

Here in line 2 and 3 is the typo Goverment (missing n). The problem is that codespell recognize only the 2nd line

def test_markup_without_prefix_suffix(self):
    sut = textruns.cut_into_runs('+Goverment+')
    self.assertEqual(sut[0][0], 'Goverment')
$ codespell foobar.py
foobar.py:2: Goverment ==> Government

Is there a good reason for this behavior or is it a bug? Using codespell 2.2.6 from PyPi.

DimitriPapadopoulos commented 6 months ago

It's always the same question, the definition of a "word". The heart of codespell is the regex that splits text into words. Its default value is: https://github.com/codespell-project/codespell/blob/3ae34b514b562bb4a235ffce32dc05cb60e87008/codespell_lib/_codespell.py#L47 You could change it to r"[\w\-]+" instead, but then codespell won't catch other typos such as ahven't.

Lots of people have tried to improve it, but the conclusion is always that the perfect regex that would fit all use cases does not exist. Parsers specific to the programming language at hand need to be added to the equation — or spellcheckers based on deep learning...

buhtz commented 6 months ago

I don't understand that regex but also tried it at regex101. But I guess

sut = textruns.cut_into_runs('+Goverment+')

In that line codespell find (because of the regex) the word Goverment and will look up this in the dictionary?

And in this line

self.assertEqual(sut[0][0], 'Goverment')

The word is 'Goverment' (including the '). Such a word do not exist in the dict and that is why it does not trigger an error?