Closed lddubeau closed 7 years ago
:+1: Good catch!
Now that I revisit this, I also see a related issue: the code should catch regexes which include the USP when it's part of a wider range, e.g. [Z-\uF000]
which includes [\uD800-\uDFFF]
.
Hmmmm....
And you popped a whole can of worms...
Another issue is this:
The generated regex sets can include producing such beauties as the set expression \\u000e-\\u001e0-\u011f
which SHOULD have been encoded as the JavaScript regex set expression \\u000e-\\u{001e}0-\u011f
since '\\u001e' + '0'
makes \\u001e0
which is then parsed as possibly-an-extended-plane Unicode Character \u{001e0}
(note the five hex digits here) rather than two characters ['\u001e', '0']
.
Oh boy...
Scratch that last message of mine: yes, there are additional issues, but the one mentioned doesn't work like that. \\u001e0
is correctly parsed as \\u001e + '0'
all the time.
There is an issue with parsing \\u{12345}
Unicode codepoints though: we don't support extended planes in regexes, not really, but then the code still uses the wrong regexes to detect these.
To be continued...
By the way @lddubeau : do note that you seem to use Supplementary Plane Unicode though, due to this bit of regex: \u10000-\u10FFFF
, which then should be written as \u{10000}-\u{10FFFF}
(see also: https://rainsoft.io/what-every-javascript-developer-should-know-about-unicode/ )
However, this is currently a moot point as JISON doesn't support Astral Plane Unicode Codepoints (a.k.a. Supplementary Plane Characters, i.e. anything above U+FFFF). I'm looking into supporting ES2015 regex /u flag though, but that's future noise.
For now I'll first create a new patch release to mark the fixing of all the other bugs your issue report helped uncover! :+1:
WARNING: the error message has temporarily been disabled; it doesn't mean this will now work as expected for Astral Plane chars!
Thanks for pointing the bad syntax in \u10000-\u10FFFF
. And yes, I am using the plane because of this range. I did not check the terminology carefully and was thrown off by the error being raised because of 0xd800
, which I don't refer to in my regular expression. I've fixed the title.
By the way, The \u{10000}-\u{10FFFF}
range is not my choice. It is part of the XML specification.
And thanks for the link about JavaScript and Unicode. In the early 2000s I wrote extensive code dealing with the issues of combining characters (and that was in Java, not JavaScript) but I could get away with ignoring the astral plane back then so I've not had to pay much attention to the issues it causes.
FYI: this will take a while to fix; I've got little spare time ATM and this needs some internal rework to work correctly for the entire Unicode range.
I have the following regex in my grammar file:
And I get the following error due to the regex above:
I know the regex is the source of the error because if I remove it, then everything is fine. Specifically, the problem is \u0020-\uD7FF.
Looking at the code in
regexp-lexer.js
, I've deduced that the problem occurs whenjison-gho
computes an inverted character set when it tries optimizing the regular expression. When it computes the inverted set, one of the range boundaries is0xD7FF + 1
, and the error is triggered.I can understand complaining about a user-written regular expression that goes into the supplementary plane, but here we're talking about a regular expression that is computed behind the scenes. Should there even be an error raised on the inverted set which is computed internally?