Java: Update explanation when "Unicode matching" is enabled (or disabled)

firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com

3.26k stars 199 forks source link

Java: Update explanation when "Unicode matching" is enabled (or disabled) #2022

Closed Marcono1234 closed 1 year ago

Marcono1234 commented 1 year ago

Feature

For Java, when the "Unicode matching" flag is enabled some of the character classes behave differently, see java.util.regex.Pattern documentation, table below the paragraph "The following Predefined Character classes...".

It might be useful if Regex101 could adjust the description of the character classes in the "Explanation" tab depending on whether "Unicode matching" is enabled (or disabled). For example, when "Unicode matching" is enabled ١ (U+0661) matches the pattern \d, but the "Explanation" claims "equivalent to [0-9]": Match screenshot

Note that this affects more than just \d, see the table in the Pattern documentation mentioned above.

working-name commented 1 year ago

Relevant portion:

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Technical Standard #18: Unicode Regular Expressions, when UNICODE_CHARACTER_CLASS flag is specified.

\d	A digit: \p{IsDigit}

Marcono1234 commented 1 year ago

Relevant portion: ...

Though as mentioned, this affects more than just \d:

\d: "... [0-9]"
\D: "... [^0-9]"
\s: "... [\r\n\t\f\v ]"
\S: "... [^\r\n\t\f\v ]"
\w: "... [a-zA-Z0-9_]"
\W: "... [^a-zA-Z0-9_]"

This list might be incomplete, please verify this yourself as well.

firasdib commented 1 year ago

Thank you, I will fix this in the next major release.

Marcono1234 commented 1 year ago

Thanks a lot!

Just to clarify, is it intended (maybe due to technical limitations) that the description for Unicode mode differs from the Pattern documentation?

Class	`Pattern`	regex101
`\d`	A digit: `\p{IsDigit}`	a digit zero through nine in any script except ideographic scripts (equivalent to `\p{Nd}`)
`\s`	A whitespace character: `\p{IsWhite_Space}`	any kind of invisible character (equivalent to `[\p{Z}\h\v]`)
...	...	...