GerHobbelt / jison

bison / YACC / LEX in JavaScript (LALR(1), SLR(1), etc. lexer/parser generator)
https://gerhobbelt.github.io/jison/
MIT License
118 stars 20 forks source link

Supplementary plane error due to a character I'm not actually using explicitly in my regex. #6

Closed lddubeau closed 7 years ago

lddubeau commented 7 years ago

I have the following regex in my grammar file:

[\t\n\r\u0020-\uD7FF\uE000\uFFFD\u10000-\u10FFFF]

And I get the following error due to the regex above:

Error: You have Unicode Supplementary Plane content in a regex set: JavaScript has severe problems with Supplementary Plane content, particularly in regexes, so you are kindly required to get rid of this stuff. Sorry! (Offending UCS-2 code which triggered this: 0xd800)

I know the regex is the source of the error because if I remove it, then everything is fine. Specifically, the problem is \u0020-\uD7FF.

Looking at the code in regexp-lexer.js, I've deduced that the problem occurs when jison-gho computes an inverted character set when it tries optimizing the regular expression. When it computes the inverted set, one of the range boundaries is 0xD7FF + 1, and the error is triggered.

I can understand complaining about a user-written regular expression that goes into the supplementary plane, but here we're talking about a regular expression that is computed behind the scenes. Should there even be an error raised on the inverted set which is computed internally?

GerHobbelt commented 7 years ago

:+1: Good catch!

Now that I revisit this, I also see a related issue: the code should catch regexes which include the USP when it's part of a wider range, e.g. [Z-\uF000] which includes [\uD800-\uDFFF].

Hmmmm....

GerHobbelt commented 7 years ago

And you popped a whole can of worms...

Another issue is this: The generated regex sets can include producing such beauties as the set expression \\u000e-\\u001e0-\u011f which SHOULD have been encoded as the JavaScript regex set expression \\u000e-\\u{001e}0-\u011f since '\\u001e' + '0' makes \\u001e0 which is then parsed as possibly-an-extended-plane Unicode Character \u{001e0} (note the five hex digits here) rather than two characters ['\u001e', '0'].

Oh boy...

GerHobbelt commented 7 years ago

Scratch that last message of mine: yes, there are additional issues, but the one mentioned doesn't work like that. \\u001e0 is correctly parsed as \\u001e + '0' all the time. There is an issue with parsing \\u{12345} Unicode codepoints though: we don't support extended planes in regexes, not really, but then the code still uses the wrong regexes to detect these.

To be continued...

GerHobbelt commented 7 years ago

By the way @lddubeau : do note that you seem to use Supplementary Plane Unicode though, due to this bit of regex: \u10000-\u10FFFF, which then should be written as \u{10000}-\u{10FFFF} (see also: https://rainsoft.io/what-every-javascript-developer-should-know-about-unicode/ )

However, this is currently a moot point as JISON doesn't support Astral Plane Unicode Codepoints (a.k.a. Supplementary Plane Characters, i.e. anything above U+FFFF). I'm looking into supporting ES2015 regex /u flag though, but that's future noise.

For now I'll first create a new patch release to mark the fixing of all the other bugs your issue report helped uncover! :+1:

GerHobbelt commented 7 years ago

WARNING: the error message has temporarily been disabled; it doesn't mean this will now work as expected for Astral Plane chars!

lddubeau commented 7 years ago

Thanks for pointing the bad syntax in \u10000-\u10FFFF. And yes, I am using the plane because of this range. I did not check the terminology carefully and was thrown off by the error being raised because of 0xd800, which I don't refer to in my regular expression. I've fixed the title.

By the way, The \u{10000}-\u{10FFFF} range is not my choice. It is part of the XML specification.

And thanks for the link about JavaScript and Unicode. In the early 2000s I wrote extensive code dealing with the issues of combining characters (and that was in Java, not JavaScript) but I could get away with ignoring the astral plane back then so I've not had to pay much attention to the issues it causes.

GerHobbelt commented 7 years ago

FYI: this will take a while to fix; I've got little spare time ATM and this needs some internal rework to work correctly for the entire Unicode range.