for-GET / jesse

jesse (JSon Schema Erlang) is an implementation of a JSON Schema validator for Erlang.
https://github.com/for-get/jesse
Apache License 2.0
126 stars 64 forks source link

"pattern" and "patternProperties" do not support unicode properties #104

Closed nalundgaard closed 2 years ago

nalundgaard commented 3 years ago

I have noticed that when putting regex patterns in pattern and patternProperties, \w does not match unicode letters (like ū, , п) as expected. You have to use [\p{L}\p{N}_] instead. It looks like this is because the patterns are not compiled/executed with the ucp in re:compile/2:

ucp

Specifies that Unicode character properties are to be used when resolving \B, \b, \D, \d, \S, \s, \W and \w. Without this flag, only ISO Latin-1 properties are used. Using Unicode properties hurts performance, but is semantically correct when working with Unicode characters beyond the ISO Latin-1 range.

Later it explains:

By default, in unicode mode, characters with values > 255, that is, all characters outside the ISO Latin-1 character set, never match \d, \s, or \w, and always match \D, \S, and \W. These sequences retain their original meanings from before UTF support was available, mainly for efficiency reasons. However, if option ucp is set, the behavior is changed so that Unicode properties are used to determine character types, as follows:

\d Any character that \p{Nd} matches (decimal digit) \s Any character that\p{Z} or \h or \v \w Any character that matches \p{L} or \p{N} matches, plus underscore

Is this intended to not use this behavior? It seems unexpected to me (having this behavior off by default is not generally expected). Obviously I have a workaround just using the pattern I mentioned ([\p{L}\p{N}_]), so not a big deal if this isn't the behavior you all want.

andreineculau commented 3 years ago

Nicholas, you are right across the board. Thanks for the detailed comments.

Regarding intention - I believe the initial author didn't think of it as a must at that time. If memory serves me well I enabled unicode matching, but didn't go for the ucp option. There's a question around performance and I guess in an extended setup with many schemata, every millisecond adds up. I remember vaguely that in my setup I was experiencing a big hit at the time. Possibly/probably it's all gone by now.

One idea might be to add an option re_options to jesse to pass on re:compile/2 options (actually re:run/2 just to be pedantic, since that's what gets called from jesse) and default to unicode and ucp. This way there are less surprises, and those who are impacted by performance loss, can configure jesse to do less and be faster.

Would you be up for filing a PR? 🙏

nalundgaard commented 3 years ago

Yeah, sure! I'll try to have a look at the requisite change (and figuring out how to add a test for it) when I have some free time. Thanks for the explanation, and totally agree about the performance cost issues if you don't need it, the re module is quite... slow.