Can't ignore an unknown acronym only when it's written in upper case

vadz commented 4 years ago

I have a lot of occurrences of the word SEH in my code and I'd like codespell to leave it alone, but this doesn't seem to work, using neither -I nor -L option, e.g.:

% echo SEH | codespell -L SEH -
1: SEH
        SEH ==> SHE

The problem seems to be the case, i.e. if I use -L seh, it does recognize "SEH", but I'd prefer not to have to do it, as "seh" is much more likely to be a misspelling of "she" than anything else -- "SEH" itself is always written in upper case.

I'm not sure if it's the same as #1128, as it happens for camel case identifier there, while my problem doesn't seem to be related to camel case (or an identifier, for that matter, as the problematic acronym only occurs inside comments), sorry in advance if it's the same bug and so an already known problem.

P.S. This is using 1.17.1 from Debian unstable.

peternewman commented 4 years ago

Hi @vadz ,

Currently codespell is case-insensitive, so as per the readme: https://github.com/codespell-project/codespell/blob/master/README.rst

The -I flag can be used for a list of certain words to allow that are in the codespell dictionaries. The format of the file is one word per line. Invoke using: codespell -I path/to/file.txt to execute codespell referencing said list of allowed words. Important note: The list passed to -I is case-sensitive based on how it is listed in the codespell dictionaries.

So you need to use -L seh.

Our dictionary checking needs to be case-insensitive because for example a C define is normally written in upper case so we'd want to catch SPELING as being an error without the dictionary needing to have both speling AND SPELING in it. I guess we could theoretically make the ignores match the case of the word in the code, but if you're using a term which codespell thinks is a typo you might have it as a define in capitals, a variable and possible also in camel case, so I think it would get rather complicated. I'll leave this open though, as it is different to the camel case request and another possible feature request.

Holzhaus commented 4 years ago

I have the same problem. I want to exclude InOut, but not inout (because that's possibly a typo and supposed to say input). Is there some way?

Holzhaus commented 4 years ago

How about filtering matches after looking them up in the dictionary?

peternewman commented 4 years ago

I have the same problem. I want to exclude InOut, but not inout (because that's possibly a typo and supposed to say input). Is there some way?

So you want to do it based on the case of the typo?

Currently -x FILE, --exclude-file FILE would be your only option.

How about filtering matches after looking them up in the dictionary?

Do you mean based on the correction? I don't really follow.

Holzhaus commented 4 years ago

I have the same problem. I want to exclude InOut, but not inout (because that's possibly a typo and supposed to say input). Is there some way?

So you want to do it based on the case of the typo?

Currently -x FILE, --exclude-file FILE would be your only option.

That would ignore the whole line and I'd also have to specify every line that this word occurs in, right?

How about filtering matches after looking them up in the dictionary?

Do you mean based on the correction? I don't really follow.

My code has something like char* pInOut;
I run codespell --ignore-regex "\\Wp(?=[A-Z]\\w*\\W)" to strip out the p prefix for pointers
codespell warns InOut => input, in out. This is where we could filter using a case-sensitive whiltelist (containing InOut in this case)

Using -L InOut does not work and -L inout is not feasible because I only want to ignore words with that that exact case.

Right now I worked around it by writing a small wrapper that reads a .codespellignore file and creates an ignore-regex pattern based on that, but that's probably not really performant and seems "hacky" to me.

peternewman commented 4 years ago

That would ignore the whole line and I'd also have to specify every line that this word occurs in, right?

Yes, not ideal but quite easy with grep. I do it quite a lot: https://github.com/OpenLightingProject/ola/blob/master/.codespellignorelines

It's not perfect, but for a lot of edge cases, like uint/unit, even case sensitivity wouldn't help.

Actually for this specific use case, the ignore regex is probably perfect: https://github.com/codespell-project/codespell/blob/master/codespell_lib/_codespell.py#L277-L283

Doh, just realised you're doing that already! You can vanish pInOut before you even start matching things. Nice trick with stripping the p's off the front by the way!

Care to share your wrapper @Holzhaus as it might help others.

@vadz I think the above would work for you too?

3. codespell warns `InOut => input, in out`. This is where we could filter using a case-sensitive whiltelist (containing `InOut` in this case)

Understood. So you're talking about filtering based on the case of the MATCH rather than the dictionary entry. You probably either want two whitelists, or some way to flag whether each entry should be case sensitive or not. It's certainly possible!

Holzhaus commented 4 years ago

Care to share your wrapper @Holzhaus as it might help others.

https://github.com/mixxxdj/mixxx/pull/3066

The wrapper also parses the git diff and filters out all results that are not coming from added lines (because our legacy codebase is quite large, and fixing other peoples typos in merge commits is annoying).

vadz commented 4 years ago

@peternewman It looks like it ought to work, but I couldn't test it because of this:

% pip install git+https://github.com/codespell-project/codespell@293f12152f7a6cef6e87cd6c78d8aee69db7af28
Collecting git+https://github.com/codespell-project/codespell@293f12152f7a6cef6e87cd6c78d8aee69db7af28
  Cloning https://github.com/codespell-project/codespell (to revision 293f12152f7a6cef6e87cd6c78d8aee69db7af28) to /tmp/pip-req-build-8HAgKz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-8HAgKz/setup.py", line 10, in <module>
        from codespell_lib import __version__
      File "codespell_lib/__init__.py", line 1, in <module>
        from ._codespell import main, _script_main, VERSION as __version__  # noqa
      File "codespell_lib/_codespell.py", line 24, in <module>
        import configparser
    ImportError: No module named configparser

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-8HAgKz/

I didn't look into it in details, so it's perfectly possible that I'm doing something wrong, but I couldn't actually test your suggestion...

larsoner commented 4 years ago

This is a builtin module going back to 2.7:

https://docs.python.org/2.7/library/configparser.html

Are you on 2.6? It's unsupported

Holzhaus commented 4 years ago

You need at least Python 3 I think. In python 2.7 the module has another name (ConfigParser instead of configparser) and Python 2 has reached end-of-life months ago anyway.

vadz commented 4 years ago

Oops, sorry, pip indeed uses 2.7.16 by default on my Debian 10 (Buster) system, I keep forgetting that I need to use pip3 explicitly. Its output was rather weird anyhow:

Collecting git+https://github.com/codespell-project/codespell@293f12152f7a6cef6e87cd6c78d8aee69db7af28
  Cloning https://github.com/codespell-project/codespell (to revision 293f12152f7a6cef6e87cd6c78d8aee69db7af28) to /tmp/pip-req-build-5la8ce8k
Building wheels for collected packages: codespell
  Running setup.py bdist_wheel for codespell ... error
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-5la8ce8k/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-wkwq0mmi --python-tag cp37:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for codespell
  Running setup.py clean for codespell
Failed to build codespell
Installing collected packages: codespell
  Running setup.py install for codespell ... done
Successfully installed codespell-2.0.dev0

with the "Failed" line in ominous red, but it seems to have been installed successfully and I can confirm that adding --ignore-regex '\WSEH\W' to the command line (and removing the suppression lines) works for me, thanks.

Holzhaus commented 4 years ago

You need to install wheel before installation:

pip3 install wheel

peternewman commented 4 years ago

Oops, sorry, pip indeed uses 2.7.16 by default on my Debian 10 (Buster) system, I keep forgetting that I need to use pip3 explicitly.

I think this should do the trick for you: sudo update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

peterjc commented 3 years ago

This works OK to ignore three letter upper case words (i.e. many acronyms):

codespell --ignore-regex '\W[A-Z][A-Z][A-Z]\W'

Or being more clever, 2 or 3 letter acronyms:

codespell --ignore-regex '\W([A-Z]{2,3})\W'

However, this fails with consecutive words to be ignored via the regex, see #2049 for a potential solution.

LilithHafner commented 1 year ago

I think this is a pretty reasonable workflow:

run codespell
see a false positive of the form SEH ==> SHE
add SEH to the ignore list
run codespell
see a false positive of the form SEH ==> SHE
be sad

I'd like steps 5 and 6 to be different. In order of preference:

SEH in the ignore list matches SEH in code
SEH in the ignore list does nothing, but a warning/error is produced that there is a useless entry in the ignore list (which says why it's useless)
SEH in the ignore list does nothing and there is no warning/error at all (current situation)

DimitriPapadopoulos commented 1 year ago

Pull requests are welcome, but make sure you address components such as https://github.com/codespell-project/codespell/issues/1578#issuecomment-654183259.

vEnhance commented 10 months ago

Pull requests are welcome, but make sure you address components such as #1578 (comment).

Just to be clear, would you consider a pull request where:

when a word is written in all lowercase in the ignore list, the behavior remains the same (so seh would match seh, Seh, SEH, etc.), but
if a word is written with a specific casing in the ignore list, then it matches only that specific case, (so SEH matches only SEH), rather than the current behavior of doing nothing with no warning.

I believe this is also the suggestion in https://github.com/codespell-project/codespell/issues/1578#issuecomment-1763478663

DimitriPapadopoulos commented 10 months ago

I am not the maintainer, but I think that would be a good, long-awaited change. I am only worried about compatibility. I don't have a clear idea (yet?) of what could possibly go wrong in existing codebases when we release such a change. Comments on that?

Also, what about words in the dictionary themselves that are not all lowercase? Such as Feburary? Should the case have some special meaning? I am referring to the typo, of course, not to the suggested fixes which should keep the uppercase as required.

grep '^[A-Z]'

```console $ grep '^[A-Z]' codespell_lib/data/dictionary*.txt codespell_lib/data/dictionary_code.txt:MSDOS->MS-DOS codespell_lib/data/dictionary.txt:ACI->ACPI codespell_lib/data/dictionary.txt:ACII->ASCII codespell_lib/data/dictionary.txt:Amercia->America codespell_lib/data/dictionary.txt:Ardiuno->Arduino codespell_lib/data/dictionary.txt:Australien->Australian codespell_lib/data/dictionary.txt:Austrlaian->Australian codespell_lib/data/dictionary.txt:Berkley->Berkeley codespell_lib/data/dictionary.txt:Bernouilli->Bernoulli codespell_lib/data/dictionary.txt:Blitzkreig->Blitzkrieg codespell_lib/data/dictionary.txt:Bonnano->Bonanno codespell_lib/data/dictionary.txt:Brasillian->Brazilian codespell_lib/data/dictionary.txt:Britian->Britain codespell_lib/data/dictionary.txt:Brittish->British codespell_lib/data/dictionary.txt:Buddah->Buddha codespell_lib/data/dictionary.txt:Buddist->Buddhist codespell_lib/data/dictionary.txt:Caesarian->Caesarean codespell_lib/data/dictionary.txt:Cambrige->Cambridge codespell_lib/data/dictionary.txt:Capetown->Cape Town codespell_lib/data/dictionary.txt:Carmalite->Carmelite codespell_lib/data/dictionary.txt:Carnagie->Carnegie codespell_lib/data/dictionary.txt:Carnagie-Mellon->Carnegie-Mellon codespell_lib/data/dictionary.txt:Carnigie->Carnegie codespell_lib/data/dictionary.txt:Carnigie-Mellon->Carnegie-Mellon codespell_lib/data/dictionary.txt:Carribbean->Caribbean codespell_lib/data/dictionary.txt:Carribean->Caribbean codespell_lib/data/dictionary.txt:Carthagian->Carthaginian codespell_lib/data/dictionary.txt:Cataline->Catiline, Catalina, codespell_lib/data/dictionary.txt:Ceasar->Caesar codespell_lib/data/dictionary.txt:Celcius->Celsius codespell_lib/data/dictionary.txt:Cgywin->Cygwin codespell_lib/data/dictionary.txt:Champange->Champagne codespell_lib/data/dictionary.txt:Cincinatti->Cincinnati codespell_lib/data/dictionary.txt:Cincinnatti->Cincinnati codespell_lib/data/dictionary.txt:Conneticut->Connecticut codespell_lib/data/dictionary.txt:Cyrllic->Cyrillic codespell_lib/data/dictionary.txt:Dardenelles->Dardanelles codespell_lib/data/dictionary.txt:DCHP->DHCP codespell_lib/data/dictionary.txt:Decemer->December codespell_lib/data/dictionary.txt:Dravadian->Dravidian codespell_lib/data/dictionary.txt:EBCIDC->EBCDIC codespell_lib/data/dictionary.txt:EDCDIC->EBCDIC codespell_lib/data/dictionary.txt:Enlish->English, enlist, codespell_lib/data/dictionary.txt:Europian->European codespell_lib/data/dictionary.txt:Europians->Europeans codespell_lib/data/dictionary.txt:Eurpean->European codespell_lib/data/dictionary.txt:Eurpoean->European codespell_lib/data/dictionary.txt:Fahrenheight->Fahrenheit codespell_lib/data/dictionary.txt:Farenheight->Fahrenheit codespell_lib/data/dictionary.txt:Farenheit->Fahrenheit codespell_lib/data/dictionary.txt:Febuary->February codespell_lib/data/dictionary.txt:Feburary->February codespell_lib/data/dictionary.txt:Flemmish->Flemish codespell_lib/data/dictionary.txt:Formalhaut->Fomalhaut codespell_lib/data/dictionary.txt:Foundland->Newfoundland codespell_lib/data/dictionary.txt:Fransiscan->Franciscan codespell_lib/data/dictionary.txt:Fransiscans->Franciscans codespell_lib/data/dictionary.txt:FTBS->FTBFS codespell_lib/data/dictionary.txt:Galations->Galatians codespell_lib/data/dictionary.txt:Gameboy->Game Boy codespell_lib/data/dictionary.txt:Ghandi->Gandhi codespell_lib/data/dictionary.txt:Godounov->Godunov codespell_lib/data/dictionary.txt:Gothenberg->Gothenburg codespell_lib/data/dictionary.txt:Gottleib->Gottlieb codespell_lib/data/dictionary.txt:Guaduloupe->Guadalupe, Guadeloupe, codespell_lib/data/dictionary.txt:Guadulupe->Guadalupe, Guadeloupe, codespell_lib/data/dictionary.txt:Guatamala->Guatemala codespell_lib/data/dictionary.txt:Guatamalan->Guatemalan codespell_lib/data/dictionary.txt:Guilia->Giulia codespell_lib/data/dictionary.txt:Guilio->Giulio codespell_lib/data/dictionary.txt:Guiness->Guinness codespell_lib/data/dictionary.txt:Guiseppe->Giuseppe codespell_lib/data/dictionary.txt:Habsbourg->Habsburg codespell_lib/data/dictionary.txt:Hallowean->Hallowe'en, Halloween, codespell_lib/data/dictionary.txt:Hatian->Haitian codespell_lib/data/dictionary.txt:Heidelburg->Heidelberg codespell_lib/data/dictionary.txt:Hyperldger->Hyperledger codespell_lib/data/dictionary.txt:I'sd->I'd codespell_lib/data/dictionary.txt:Ihaca->Ithaca codespell_lib/data/dictionary.txt:ISNB->ISBN codespell_lib/data/dictionary.txt:Israelies->Israelis codespell_lib/data/dictionary.txt:Istambul->Istanbul codespell_lib/data/dictionary.txt:Japanes->Japanese codespell_lib/data/dictionary.txt:Johanine->Johannine codespell_lib/data/dictionary.txt:Jospeh->Joseph codespell_lib/data/dictionary.txt:Laotion->Laotian, lotion, codespell_lib/data/dictionary.txt:Laotions->Laotians, lotions, codespell_lib/data/dictionary.txt:Linix->Linux codespell_lib/data/dictionary.txt:Lybia->Libya codespell_lib/data/dictionary.txt:Malcom->Malcolm codespell_lib/data/dictionary.txt:Massachusettes->Massachusetts codespell_lib/data/dictionary.txt:Massachussets->Massachusetts codespell_lib/data/dictionary.txt:Massachussetts->Massachusetts codespell_lib/data/dictionary.txt:Mediteranean->Mediterranean codespell_lib/data/dictionary.txt:Michagan->Michigan codespell_lib/data/dictionary.txt:Micorsoft->Microsoft codespell_lib/data/dictionary.txt:Micosoft->Microsoft codespell_lib/data/dictionary.txt:Microfost->Microsoft codespell_lib/data/dictionary.txt:Microsfoft->Microsoft codespell_lib/data/dictionary.txt:Microsft->Microsoft codespell_lib/data/dictionary.txt:Microsof->Microsoft codespell_lib/data/dictionary.txt:Microsofot->Microsoft codespell_lib/data/dictionary.txt:Micrsft->Microsoft codespell_lib/data/dictionary.txt:Micrsoft->Microsoft codespell_lib/data/dictionary.txt:MingGW->MinGW codespell_lib/data/dictionary.txt:Miscrosoft->Microsoft codespell_lib/data/dictionary.txt:Misouri->Missouri codespell_lib/data/dictionary.txt:Missisipi->Mississippi codespell_lib/data/dictionary.txt:Missisippi->Mississippi codespell_lib/data/dictionary.txt:Mississipi->Mississippi codespell_lib/data/dictionary.txt:Mocrosoft->Microsoft codespell_lib/data/dictionary.txt:Monserrat->Montserrat codespell_lib/data/dictionary.txt:Montnana->Montana codespell_lib/data/dictionary.txt:Mythraic->Mithraic codespell_lib/data/dictionary.txt:Naploeon->Napoleon codespell_lib/data/dictionary.txt:Napolean->Napoleon codespell_lib/data/dictionary.txt:Napoleonian->Napoleonic codespell_lib/data/dictionary.txt:Nazereth->Nazareth codespell_lib/data/dictionary.txt:Newyorker->New Yorker codespell_lib/data/dictionary.txt:Novemer->November codespell_lib/data/dictionary.txt:Novermber->November codespell_lib/data/dictionary.txt:Nullabour->Nullarbor codespell_lib/data/dictionary.txt:Nuremburg->Nuremberg codespell_lib/data/dictionary.txt:Octobor->October codespell_lib/data/dictionary.txt:Palistian->Palestinian codespell_lib/data/dictionary.txt:Palistinian->Palestinian codespell_lib/data/dictionary.txt:Palistinians->Palestinians codespell_lib/data/dictionary.txt:Papanicalou->Papanicolaou codespell_lib/data/dictionary.txt:Peloponnes->Peloponnese, Peloponnesus, codespell_lib/data/dictionary.txt:Pennyslvania->Pennsylvania codespell_lib/data/dictionary.txt:Philipines->Philippines codespell_lib/data/dictionary.txt:Phillipine->Philippine codespell_lib/data/dictionary.txt:Phillippines->Philippines codespell_lib/data/dictionary.txt:Phonecian->Phoenecian codespell_lib/data/dictionary.txt:POSIX-complient->POSIX-compliant codespell_lib/data/dictionary.txt:Postdam->Potsdam codespell_lib/data/dictionary.txt:Premonasterians->Premonstratensians codespell_lib/data/dictionary.txt:Pucini->Puccini codespell_lib/data/dictionary.txt:Puertorrican->Puerto Rican codespell_lib/data/dictionary.txt:Puertorricans->Puerto Ricans codespell_lib/data/dictionary.txt:Queenland->Queensland codespell_lib/data/dictionary.txt:Rockerfeller->Rockefeller codespell_lib/data/dictionary.txt:Russion->Russian codespell_lib/data/dictionary.txt:Sacremento->Sacramento codespell_lib/data/dictionary.txt:Sagitarius->Sagittarius codespell_lib/data/dictionary.txt:Sanhedrim->Sanhedrin codespell_lib/data/dictionary.txt:Septemer->September codespell_lib/data/dictionary.txt:Sionist->Zionist codespell_lib/data/dictionary.txt:Sionists->Zionists codespell_lib/data/dictionary.txt:Sixtin->Sistine, Sixteen, codespell_lib/data/dictionary.txt:Skagerak->Skagerrak codespell_lib/data/dictionary.txt:Soalris->Solaris codespell_lib/data/dictionary.txt:Thse->these, this, codespell_lib/data/dictionary.txt:Tolkein->Tolkien codespell_lib/data/dictionary.txt:Toosday->Tuesday codespell_lib/data/dictionary.txt:Tosday->Tuesday, today, codespell_lib/data/dictionary.txt:Tuscon->Tucson codespell_lib/data/dictionary.txt:Twosday->Tuesday codespell_lib/data/dictionary.txt:Ukranian->Ukrainian codespell_lib/data/dictionary.txt:Unicde->Unicode codespell_lib/data/dictionary.txt:UnitesStates->UnitedStates codespell_lib/data/dictionary.txt:UTF8ness->UTF-8-ness codespell_lib/data/dictionary.txt:Voight->Voigt codespell_lib/data/dictionary.txt:Vulacn->Vulcan codespell_lib/data/dictionary.txt:Vulakn->Vulkan codespell_lib/data/dictionary.txt:Yementite->Yemenite, Yemeni, ```

@larsoner Would you agree to something like https://github.com/codespell-project/codespell/issues/1578#issuecomment-1874658159?

larsoner commented 10 months ago

Sure, and if we're really worried about backward compat we can make the behavior opt in with --case-sensitive or something

vEnhance commented 10 months ago

Interestingly, it seems like --uri-ignore-words-list has a completely different behavior from the "normal" words list:

$ codespell --version
2.2.6

$ cat hi.txt
https://en.wikipedia.org/wiki/acsii
https://en.wikipedia.org/wiki/ACSII

$ codespell hi.txt
hi.txt:1: acsii ==> ascii
hi.txt:2: ACSII ==> ASCII

$ codespell hi.txt --uri-ignore-words-list acsii
hi.txt:2: ACSII ==> ASCII

$ codespell hi.txt --uri-ignore-words-list ACSII
hi.txt:1: acsii ==> ascii

$ codespell hi.txt --uri-ignore-words-list ACSII,acsii

$ codespell hi.txt -L acsii

$ codespell hi.txt -L ACSII
hi.txt:1: acsii ==> ascii
hi.txt:2: ACSII ==> ASCII

Meanwhile, the documentation under codespell --help doesn't seem to hint at this difference:

  -I FILE, --ignore-words FILE
                        file that contains words that will be ignored by codespell.
                        File must contain 1 word per line. Words are case sensitive
                        based on how they are written in the dictionary file
  -L WORDS, --ignore-words-list WORDS
                        comma separated list of words to be ignored by codespell.
                        Words are case sensitive based on how they are written in
                        the dictionary file
  --uri-ignore-words-list WORDS
                        comma separated list of words to be ignored by codespell in
                        URIs and emails only. Words are case sensitive based on how
                        they are written in the dictionary file. If set to "*", all
                        misspelling in URIs and emails will be ignored.

I'm going to focus on the "normal" ignore list for this issue/PR and leave the behavior for URI unchanged I guess, but maybe the help should be reworded too?

DimitriPapadopoulos commented 10 months ago

For now, let's focus on improving the behaviour of codespell. Then we can improve the documentation.

It seems impossible to provide decent, comprehensible documentation with the current behaviour. See discussion in #3210 and specifically https://github.com/codespell-project/codespell/issues/3210#issuecomment-1812044718.

codespell-project / codespell

Can't ignore an unknown acronym only when it's written in upper case #1578