Open vadz opened 4 years ago
Hi @vadz ,
Currently codespell is case-insensitive, so as per the readme: https://github.com/codespell-project/codespell/blob/master/README.rst
The -I flag can be used for a list of certain words to allow that are in the codespell dictionaries. The format of the file is one word per line. Invoke using: codespell -I path/to/file.txt to execute codespell referencing said list of allowed words. Important note: The list passed to -I is case-sensitive based on how it is listed in the codespell dictionaries.
So you need to use -L seh
.
Our dictionary checking needs to be case-insensitive because for example a C define is normally written in upper case so we'd want to catch SPELING
as being an error without the dictionary needing to have both speling AND SPELING in it. I guess we could theoretically make the ignores match the case of the word in the code, but if you're using a term which codespell thinks is a typo you might have it as a define in capitals, a variable and possible also in camel case, so I think it would get rather complicated. I'll leave this open though, as it is different to the camel case request and another possible feature request.
I have the same problem. I want to exclude InOut
, but not inout
(because that's possibly a typo and supposed to say input
). Is there some way?
How about filtering matches after looking them up in the dictionary?
I have the same problem. I want to exclude
InOut
, but notinout
(because that's possibly a typo and supposed to sayinput
). Is there some way?
So you want to do it based on the case of the typo?
Currently -x FILE, --exclude-file FILE
would be your only option.
How about filtering matches after looking them up in the dictionary?
Do you mean based on the correction? I don't really follow.
I have the same problem. I want to exclude
InOut
, but notinout
(because that's possibly a typo and supposed to sayinput
). Is there some way?So you want to do it based on the case of the typo?
Currently
-x FILE, --exclude-file FILE
would be your only option.
That would ignore the whole line and I'd also have to specify every line that this word occurs in, right?
How about filtering matches after looking them up in the dictionary?
Do you mean based on the correction? I don't really follow.
char* pInOut;
codespell --ignore-regex "\\Wp(?=[A-Z]\\w*\\W)"
to strip out the p
prefix for pointersInOut => input, in out
. This is where we could filter using a case-sensitive whiltelist (containing InOut
in this case)Using -L InOut
does not work and -L inout
is not feasible because I only want to ignore words with that that exact case.
Right now I worked around it by writing a small wrapper that reads a .codespellignore
file and creates an ignore-regex pattern based on that, but that's probably not really performant and seems "hacky" to me.
That would ignore the whole line and I'd also have to specify every line that this word occurs in, right?
Yes, not ideal but quite easy with grep. I do it quite a lot: https://github.com/OpenLightingProject/ola/blob/master/.codespellignorelines
It's not perfect, but for a lot of edge cases, like uint/unit, even case sensitivity wouldn't help.
Actually for this specific use case, the ignore regex is probably perfect: https://github.com/codespell-project/codespell/blob/master/codespell_lib/_codespell.py#L277-L283
Doh, just realised you're doing that already! You can vanish pInOut before you even start matching things. Nice trick with stripping the p's off the front by the way!
Care to share your wrapper @Holzhaus as it might help others.
@vadz I think the above would work for you too?
3. codespell warns `InOut => input, in out`. This is where we could filter using a case-sensitive whiltelist (containing `InOut` in this case)
Understood. So you're talking about filtering based on the case of the MATCH rather than the dictionary entry. You probably either want two whitelists, or some way to flag whether each entry should be case sensitive or not. It's certainly possible!
Care to share your wrapper @Holzhaus as it might help others.
https://github.com/mixxxdj/mixxx/pull/3066
The wrapper also parses the git diff and filters out all results that are not coming from added lines (because our legacy codebase is quite large, and fixing other peoples typos in merge commits is annoying).
@peternewman It looks like it ought to work, but I couldn't test it because of this:
% pip install git+https://github.com/codespell-project/codespell@293f12152f7a6cef6e87cd6c78d8aee69db7af28
Collecting git+https://github.com/codespell-project/codespell@293f12152f7a6cef6e87cd6c78d8aee69db7af28
Cloning https://github.com/codespell-project/codespell (to revision 293f12152f7a6cef6e87cd6c78d8aee69db7af28) to /tmp/pip-req-build-8HAgKz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-8HAgKz/setup.py", line 10, in <module>
from codespell_lib import __version__
File "codespell_lib/__init__.py", line 1, in <module>
from ._codespell import main, _script_main, VERSION as __version__ # noqa
File "codespell_lib/_codespell.py", line 24, in <module>
import configparser
ImportError: No module named configparser
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-8HAgKz/
I didn't look into it in details, so it's perfectly possible that I'm doing something wrong, but I couldn't actually test your suggestion...
This is a builtin module going back to 2.7:
https://docs.python.org/2.7/library/configparser.html
Are you on 2.6? It's unsupported
You need at least Python 3 I think. In python 2.7 the module has another name (ConfigParser
instead of configparser
) and Python 2 has reached end-of-life months ago anyway.
Oops, sorry, pip indeed uses 2.7.16 by default on my Debian 10 (Buster) system, I keep forgetting that I need to use pip3 explicitly. Its output was rather weird anyhow:
Collecting git+https://github.com/codespell-project/codespell@293f12152f7a6cef6e87cd6c78d8aee69db7af28
Cloning https://github.com/codespell-project/codespell (to revision 293f12152f7a6cef6e87cd6c78d8aee69db7af28) to /tmp/pip-req-build-5la8ce8k
Building wheels for collected packages: codespell
Running setup.py bdist_wheel for codespell ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-5la8ce8k/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-wkwq0mmi --python-tag cp37:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: invalid command 'bdist_wheel'
----------------------------------------
Failed building wheel for codespell
Running setup.py clean for codespell
Failed to build codespell
Installing collected packages: codespell
Running setup.py install for codespell ... done
Successfully installed codespell-2.0.dev0
with the "Failed" line in ominous red, but it seems to have been installed successfully and I can confirm that adding --ignore-regex '\WSEH\W'
to the command line (and removing the suppression lines) works for me, thanks.
You need to install wheel before installation:
pip3 install wheel
Oops, sorry, pip indeed uses 2.7.16 by default on my Debian 10 (Buster) system, I keep forgetting that I need to use pip3 explicitly.
I think this should do the trick for you:
sudo update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
This works OK to ignore three letter upper case words (i.e. many acronyms):
codespell --ignore-regex '\W[A-Z][A-Z][A-Z]\W'
Or being more clever, 2 or 3 letter acronyms:
codespell --ignore-regex '\W([A-Z]{2,3})\W'
However, this fails with consecutive words to be ignored via the regex, see #2049 for a potential solution.
I think this is a pretty reasonable workflow:
codespell
SEH ==> SHE
SEH
to the ignore listcodespell
SEH ==> SHE
I'd like steps 5 and 6 to be different. In order of preference:
SEH
in the ignore list matches SEH
in codeSEH
in the ignore list does nothing, but a warning/error is produced that there is a useless entry in the ignore list (which says why it's useless)SEH
in the ignore list does nothing and there is no warning/error at all (current situation)Pull requests are welcome, but make sure you address components such as https://github.com/codespell-project/codespell/issues/1578#issuecomment-654183259.
Pull requests are welcome, but make sure you address components such as #1578 (comment).
Just to be clear, would you consider a pull request where:
seh
would match seh
, Seh
, SEH
, etc.), butSEH
matches only SEH
), rather than the current behavior of doing nothing with no warning.I believe this is also the suggestion in https://github.com/codespell-project/codespell/issues/1578#issuecomment-1763478663
I am not the maintainer, but I think that would be a good, long-awaited change. I am only worried about compatibility. I don't have a clear idea (yet?) of what could possibly go wrong in existing codebases when we release such a change. Comments on that?
Also, what about words in the dictionary themselves that are not all lowercase? Such as Feburary
? Should the case have some special meaning? I am referring to the typo, of course, not to the suggested fixes which should keep the uppercase as required.
@larsoner Would you agree to something like https://github.com/codespell-project/codespell/issues/1578#issuecomment-1874658159?
Sure, and if we're really worried about backward compat we can make the behavior opt in with --case-sensitive
or something
Interestingly, it seems like --uri-ignore-words-list
has a completely different behavior from the "normal" words list:
$ codespell --version
2.2.6
$ cat hi.txt
https://en.wikipedia.org/wiki/acsii
https://en.wikipedia.org/wiki/ACSII
$ codespell hi.txt
hi.txt:1: acsii ==> ascii
hi.txt:2: ACSII ==> ASCII
$ codespell hi.txt --uri-ignore-words-list acsii
hi.txt:2: ACSII ==> ASCII
$ codespell hi.txt --uri-ignore-words-list ACSII
hi.txt:1: acsii ==> ascii
$ codespell hi.txt --uri-ignore-words-list ACSII,acsii
$ codespell hi.txt -L acsii
$ codespell hi.txt -L ACSII
hi.txt:1: acsii ==> ascii
hi.txt:2: ACSII ==> ASCII
Meanwhile, the documentation under codespell --help
doesn't seem to hint at this difference:
-I FILE, --ignore-words FILE
file that contains words that will be ignored by codespell.
File must contain 1 word per line. Words are case sensitive
based on how they are written in the dictionary file
-L WORDS, --ignore-words-list WORDS
comma separated list of words to be ignored by codespell.
Words are case sensitive based on how they are written in
the dictionary file
--uri-ignore-words-list WORDS
comma separated list of words to be ignored by codespell in
URIs and emails only. Words are case sensitive based on how
they are written in the dictionary file. If set to "*", all
misspelling in URIs and emails will be ignored.
I'm going to focus on the "normal" ignore list for this issue/PR and leave the behavior for URI unchanged I guess, but maybe the help should be reworded too?
For now, let's focus on improving the behaviour of codespell. Then we can improve the documentation.
It seems impossible to provide decent, comprehensible documentation with the current behaviour. See discussion in #3210 and specifically https://github.com/codespell-project/codespell/issues/3210#issuecomment-1812044718.
I have a lot of occurrences of the word SEH in my code and I'd like codespell to leave it alone, but this doesn't seem to work, using neither
-I
nor-L
option, e.g.:The problem seems to be the case, i.e. if I use
-L seh
, it does recognize "SEH", but I'd prefer not to have to do it, as "seh" is much more likely to be a misspelling of "she" than anything else -- "SEH" itself is always written in upper case.I'm not sure if it's the same as #1128, as it happens for camel case identifier there, while my problem doesn't seem to be related to camel case (or an identifier, for that matter, as the problematic acronym only occurs inside comments), sorry in advance if it's the same bug and so an already known problem.
P.S. This is using 1.17.1 from Debian unstable.