Open jamesbraza opened 1 year ago
The underscore (_
) is part of \w
. From https://docs.python.org/3/library/re.html#regular-expression-syntax:
\w
For Unicode (str) patterns: Matches Unicode word characters; this includes alphanumeric characters (as defined by str.isalnum()) as well as the underscore (
_
). If theASCII
flag is used, only[a-zA-Z0-9_]
is matched.For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; this is equivalent to
[a-zA-Z0-9_]
. If theLOCALE
flag is used, matches characters considered alphanumeric in the current locale and the underscore.
Is there an easy way to get \w
except _
in the non-ASCII
case? It would help checking snake_case.
Unicode regexes with set operations might help, but they are not available in Python yet. From https://docs.python.org/3/library/re.html#regular-expression-syntax:
- Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. That includes sets starting with a literal
'['
or containing literal character sequences'--'
,'&&'
,'~~'
, and'||'
. To avoid a warning escape them with a backslash.
This what I have found so far, but I haven't been able to apply it to this use case yet:
A drawback of such a change is that we wouldn't be able to fix some (but not all) of the misspellings that contain an underscore, at least not by default:
clock_getttime->clock_gettime
phy_interace->phy_interface
unint8_t->uint8_t
__attribyte__->__attribute__
__cpluspus->__cplusplus
__cpusplus->__cplusplus
Unless of course, you add new misspellings such as cpluspus
.
I've been using the following for camel case, hyphen case and snake case.
(?<![a-z])[a-z'`]+|[A-Z][a-z'`]*|[a-z]+'[a-z]*|[a-z]+(?=[_-])|[a-z]+(?=[A-Z])|\d+
It indeed misses the cases where full words should be considered/checked, but sub-word typos seem to be the common case. Adding a second pass to check just full words would be nice to check for type errors in documentation.
FWIW, searched myself into this issue having seen typos
finding typos in snake_case words in
Disabled CameCased and ACRONYMs checks by default might also be wise but likely need to be configurable.
Can codespell's default regex(s) support splitting along snake case's underscore and determining misspellings within particles?
Related
--regex
to detect misspellings within snake case, and it lead to this draft PR.