Closed Marcono1234 closed 1 year ago
Relevant portion:
The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Technical Standard #18: Unicode Regular Expressions, when UNICODE_CHARACTER_CLASS flag is specified.
\d | A digit: \p{IsDigit} |
---|
Relevant portion: ...
Though as mentioned, this affects more than just \d
:
\d
: "... [0-9]"\D
: "... [^0-9]"\s
: "... [\r\n\t\f\v ]"\S
: "... [^\r\n\t\f\v ]"\w
: "... [a-zA-Z0-9_]"\W
: "... [^a-zA-Z0-9_]"This list might be incomplete, please verify this yourself as well.
Thank you, I will fix this in the next major release.
Thanks a lot!
Just to clarify, is it intended (maybe due to technical limitations) that the description for Unicode mode differs from the Pattern
documentation?
Class | Pattern |
regex101 |
---|---|---|
\d |
A digit: \p{IsDigit} |
a digit zero through nine in any script except ideographic scripts (equivalent to \p{Nd} ) |
\s |
A whitespace character: \p{IsWhite_Space} |
any kind of invisible character (equivalent to [\p{Z}\h\v] ) |
... | ... | ... |
Feature
For Java, when the "Unicode matching" flag is enabled some of the character classes behave differently, see
java.util.regex.Pattern
documentation, table below the paragraph "The following Predefined Character classes...".It might be useful if Regex101 could adjust the description of the character classes in the "Explanation" tab depending on whether "Unicode matching" is enabled (or disabled). For example, when "Unicode matching" is enabled
١
(U+0661) matches the pattern\d
, but the "Explanation" claims "equivalent to [0-9]":Note that this affects more than just
\d
, see the table in thePattern
documentation mentioned above.