codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.91k stars 466 forks source link

Request: checking within snake_case by default #2730

Open jamesbraza opened 1 year ago

jamesbraza commented 1 year ago
bad_spellling = "bad"  # Not detected in codespell==2.2.2

Can codespell's default regex(s) support splitting along snake case's underscore and determining misspellings within particles?


Related

DimitriPapadopoulos commented 1 year ago

The underscore (_) is part of \w. From https://docs.python.org/3/library/re.html#regular-expression-syntax:

\w

For Unicode (str) patterns: Matches Unicode word characters; this includes alphanumeric characters (as defined by str.isalnum()) as well as the underscore (_). If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

Is there an easy way to get \w except _ in the non-ASCII case? It would help checking snake_case.

https://github.com/codespell-project/codespell/blob/ec0a5b9e4d4167965751920368d55c3615cb9b20/codespell_lib/_codespell.py#L31

Unicode regexes with set operations might help, but they are not available in Python yet. From https://docs.python.org/3/library/re.html#regular-expression-syntax:

  • Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. That includes sets starting with a literal '[' or containing literal character sequences '--', '&&', '~~', and '||'. To avoid a warning escape them with a backslash.

This what I have found so far, but I haven't been able to apply it to this use case yet:

DimitriPapadopoulos commented 1 year ago

A drawback of such a change is that we wouldn't be able to fix some (but not all) of the misspellings that contain an underscore, at least not by default:

clock_getttime->clock_gettime
phy_interace->phy_interface
unint8_t->uint8_t
__attribyte__->__attribute__
__cpluspus->__cplusplus
__cpusplus->__cplusplus

Unless of course, you add new misspellings such as cpluspus.

Gabrielcarvfer commented 1 year ago

I've been using the following for camel case, hyphen case and snake case.

(?<![a-z])[a-z'`]+|[A-Z][a-z'`]*|[a-z]+'[a-z]*|[a-z]+(?=[_-])|[a-z]+(?=[A-Z])|\d+

It indeed misses the cases where full words should be considered/checked, but sub-word typos seem to be the common case. Adding a second pass to check just full words would be nice to check for type errors in documentation.

yarikoptic commented 3 months ago

FWIW, searched myself into this issue having seen typos finding typos in snake_case words in

Disabled CameCased and ACRONYMs checks by default might also be wise but likely need to be configurable.