Closed jbclements closed 7 years ago
I believe that Sam just fix this so that we can handle up to \uFFFF. I don't think we can handle beyond that so I will leave this in as a feature request.
Really? the 4.0 release seems to handle this without signalling that error....
In a perfect world, you'd tell me that in the release version, this would just make things silently really really slow, even in the parsing pass :).
hmm... well, have you tried reading in character values greater than FFFF? my guess is it simply didn't give you a warning last time.
Yes, that's the obvious next test.... okay, lemme try it.
Yep, right you are:
line 1:0 token recognition error at: '?'
line 1:4 token recognition error at: '?'
line 1:8 token recognition error at: '?'
line 1:12 token recognition error at: '?'
line 1:16 token recognition error at: '?'
Ok, this sounds like a feature request to allow 32-bit characters then.
The escape sequence \U
should trigger a compile error, but currently the ANTLR tool contains a line reading:
// An illegal escape seqeunce
//
// TODO: Issue error message
The escape sequence \uffff
is supported as of #267.
The JLS7 defines the use of Unicode escapes in topic 3.3. There it says, "Representing supplementary characters requires two consecutive Unicode escapes."
So shouldn't input code use two Unicode escapes -- as well as no uppercase U, as Sam pointed out. That would make it look something like, \uD8CD\uDC34". Actually, you have to represent it as \uH\uL, choosing H and L so that (using base 16 arithmetic): 10000 + (H − D800) × 400 + (L − DC00) = your code point so \uD801\uDC01 would represent the character U+010401 -- according to http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters
I am beginning to flounder a little here, over my head in Unicode. For instance, I don't see how one is to distinguish between a single supplementary character represented as two Unicode escapes and two ordinary characters that just happen to have the corresponding single Unicode escapes. In fact, these two UTF-16 characters seem to be the UTF-16 representation of the single supplementary character. Maybe, somehow, the character ranges of real characters don't overlap, but I don't see it.
The other issue is that even after translating to the two UTF-16 characters, you are going to have to represent a range across these two-character representations of supplementary characters, e.g., '\uD812\uDC34' .. '\uD813\uDC56'
The code for the Java compiler's conversion of Unicode escapes seems to ignore any issues with these supplementary characters. See method convertUnicode() in langtools-ce654f4ecfd8\src\share\classes\com\sun\tools\javac\parser\Scanner.java
So one option would be to support the characters above \uFFFF with the simple method of the Java compiler and then add support for a supplementary-character range notation.
I'm not sure what you're trying to say there.
The following range:
'\uD812\uDC34'..'\uD813\uDC56'
Needs to be written like this in ANTLR:
( '\uD812' '\uDC34'..'\uDFFF'
| '\uD813' '\uDC00'..'\uDC56'
)
I think the only work needed is:
U+FFFF
..
which can take Unicode values as input to understand values > U+FFFF
For 1), we could use curly braces à la Swift and Hack:
\u{1FFFF}..\u{10FFFF}
For 2), we just need to use Java's built-in UTF-16 surrogate pair decoding functionality (https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int- etc.).
Here's an initial prototype (Java-only) adding support for a new \u{10ABCD}
style Unicode literal escape for Unicode values > U+FFFF:
https://github.com/antlr/antlr4/compare/master...bhamiltoncx:unicode?expand=1
To support the new values, I had to change the ATN serializer/deserializer, so I introduced a new UUID.
I included a test, which passes with cd tool-testsuite && mvn -Dtest=TestUnicodeGrammar test
.
I'm pretty sure this will break other tests, which I'll focus on next.
If folks like this approach, I can extend it to the other languages.
And here's what it looks like from the command line:
% cat Unicode.g4
grammar Unicode;
r : 'hello' WORLD;
WORLD : ('\u{1F30D}' | '\u{1F30E}' | '\u{1F30F}' );
WS : [ \t\r\n]+ -> skip;
% java -jar antlr4/tool/target/antlr4-4.6.1-SNAPSHOT-complete.jar Unicode.g4
% javac -cp antlr4/runtime/Java/target/antlr4-runtime-4.6.1-SNAPSHOT.jar Unicode*.java
% echo hello 🌍 | java -cp antlr4/tool/target/antlr4-4.6.1-SNAPSHOT-complete.jar:. org.antlr.v4.gui.TestRig Unicode r -tree -encoding UTF-8
(r hello 🌍)
% echo hello 🐱 | java -cp antlr4/tool/target/antlr4-4.6.1-SNAPSHOT-complete.jar:. org.antlr.v4.gui.TestRig Unicode r -tree -encoding UTF-8
line 1:6 token recognition error at: '🐱'
line 2:0 missing WORLD at '<EOF>'
(r hello <missing WORLD>)
Specific questions for folks on this issue:
IntervalSet
s low, but it seemed awkward to do the same for transition arguments, since they don't necessarily represent Unicode values).Gang,It seems to me that a lot of code would have to change because characters are no longer one character within the strings representing token text...I'm extremely leery of supporting U32.
@parrt: I hear you! Thankfully, the huge majority of the existing code simply uses IntegerSet
, which happily supports integers larger than 0xFFFF.
(There were a few places which assumed 0xFFFF was the largest character value, but they were the minority).
I'm pretty confident I can make this work, I was just looking for general strategy tips. :)
My end goal is to create a general lexer and parser for Unicode sequences like emoji, which include sequences of Unicode values larger than 0xFFFF:
http://unicode.org/emoji/charts/emoji-zwj-sequences.html
Especially with the popularity of emoji, I think ANTLR is a great candidate to provide a universal library for general scanning and parsing of Unicode sequences.
If you think another parser library would be more appropriate, I'm happy to look elsewhere.
@bhamiltoncx well, I'm sure it could be made to work but at what cost and complexity to the 99.9999999 percent of the people that will not be doing 32-bit Unicode? Java was a big step over C++ strings in that you could handle 16 bit Unicode without UTF an memory. That was a serious pain in the ass. I'd rather not at all of that pain to my java implementation. Parsing Unicode sequences is so trivial I'm not sure why you even want a parser generator for that.
@parrt: I wish it was trivial to parse Unicode sequences! To implement a cross-platform Unicode UAX 29 grapheme cluster boundary parser is fairly tricky. Here's part of the standard:
http://unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules
We've implemented 2 or 3 UAX 29 grapheme cluster boundary parsers so far, and manually porting them to other languages and distributing bug fixes has been extremely time-consuming. I figured a parser generator would be a better strategy than manually cross-porting bug fixes.
If I missed something about the above parsers being trivial, please let me know!
@parrt and I had a nice offline discussion about this.
tl;dr:
So, I went ahead and updated this tree with runtime implementations for Python2, Python3, JavaScript, C#, and Swift, and standardized both serialized sets as well as edge arguments to use 32-bit values instead of 16-bit values:
https://github.com/antlr/antlr4/compare/master...bhamiltoncx:unicode?expand=1
I'll continue adding lots more tests to cover cases like:
Hi Ben! Heh, been thinking about char[]
vs int[]
. I think we'll need to keep char[]
and make new stream for int[]
. I have many projects that load lots and lots of files and keep them around; e.g., imagine an IDE that must keep files around. I'd rather not double their memory footprint. Maybe ANTLRInputStream32
?
Oh and regarding serialized ATN size, I think I remember Java having some static string max size per class. Some big lexers have already hit it. I'll try to think more about the serialization.
Totally hear you on memory size.
I think the best way to minimize memory is to move away from a "load the entire input file into memory" model and towards a "stream input data as needed and keep a small buffer in memory" model.
That would go a long way towards fixing the issue long-term, assuming clients don't repeatedly parse the same data over and over again.
I'll come up with a proposal for that. Might be able to squeeze it into the existing ANTLRInputStream
, or might have to start a new interface (since things like LA(int i)
will have to be able to throw IOException
).
Happy to make any fixes on the serialized ATN side. We can use pretty much any format we want for ranges and arguments. UTF-8 is pretty compact. :)
@bhamiltoncx You could probably just extend ANTLRInputStream
to override LA
and consume
to combine code unit pairs into a single code point. For example, you can use Character.toCodePoint
to combine surrogate pairs if the underlying input is a char[]
.
Concerning "all in memory" issue, but I have for example a machine learning problem that consults the token texts all the time across multiple files. Keeping stuff in memory should really be the default, rather than simulating virtual memory by swapping streams in and out. Makes it hard to port the runtime to diff languages etc...
@sharwell : Makes sense. One tricky bit is getting all the indices and offsets (for example, the int
passed to LA(int i)
correct if we do that.
Currently, the Python, C#, and Swift runtimes treat all indices and offsets in terms of Unicode code points.
However, the Java and JavaScript runtimes treat indices and offsets in terms of UTF-16 code units.
@bhamiltoncx Take a look at JavaUnicodeInputStream
, which implements the special support in the Java language for treating \u1234
(6 distinct characters of a file on disk) as a single UTF-16 code unit from the perspective of a Java language lexer. I believe it not only accounts for the special cases surrounding LA(x)
for x>1, but also highly optimized the essential cases of LA(1)
that's used by the lexer internals. Your processor would likely be much simpler than this one since at most it is combining 2 adjacent code units with much less processing involved.
:memo: Also I believe the C# runtime operates on UTF-16 code units the same way as the Java runtime.
Cool.
For the best of all possible worlds, I can put together a proof of concept which uses MappedByteBuffer to allow arbitrary seeking without reading everything in eagerly.
If we require the caller to specify the units (bytes, UTF-16 code units, or Unicode code points) by which they wish to iterate, then clients can decide exactly which use-case they want, and we don't have any memory bloat.
OK, I reverted the changes to ANTLRInputStream
and introduced a new CodePointCharStream
along with a UTF8CodePointDecoder
to optimize CPU and memory usage in the common case of UTF-8 input (so we can go directly from UTF-8 to code points without allocating a UTF-16 buffer in between).
https://github.com/antlr/antlr4/compare/master...bhamiltoncx:unicode?expand=1
This can also be used with MappedByteBuffer
, but my initial investigation shows this doesn't really save any CPU.
I also introduced a bunch more tests, but there's lots more to write.
If this looks decent, I'll split it up into a few pull requests.
Here's a quick summary of the state of ANTLR 4 runtime support for Unicode SMP values > U+FFFF
when loading input to be parsed.
I found things are a bit inconsistent at the moment whether the runtimes treat input as a series of (16-bit) UTF-16 code units or as a series of (32-bit — actually only up to 21-bit) Unicode code points.
ANTLRInputStream.java
always returns UTF-16 encoded code unitsInputStream.js
always returns UTF-16 encoded code unitsANTLRInputStream.cpp
always returns Unicode code pointsInputStream.py
always returns Unicode code pointsAntlrInputStream.cs
always returns UTF-16 encoded code unitsinput_stream.go
always returns Unicode code pointsANTLRInputStream.swift
always returns Unicode code pointsTo keep backwards compatibility, I'll add options to Java, JavaScript, and C# to return Unicode code points, but I think before too long, we'll want to make the default consistent across all the runtimes and always return Unicode code points.
If I'm reading this right, ANTLR does not currently support 32-bit characters in the lexer definition. If I want to do so, I need to decode it to the UTF-16 encoding and set the ranges as follows:
'\uD812' '\uDC34'..'\uDFFF'
Is this correct? If so, is there plans to support entering arbitrary Unicode in lexer definitions?
Sorry if any of this has already been explained, I'm just double-checking to make sure that if I added anything I wouldn't be duplicating someone's work.
@kenzierocks : That's correct.
As a further note, that workaround will likely only function correctly with runtimes based on UTF-16 (Java, JavaScript, and C#). Runtimes based on Unicode (Python2, Python3, Swift, and C++) are probably not going to do the right thing for UTF-16.
I'm in the process of fixing this issue. The PRs I've attached to this issue are to allow full Unicode code points (either as raw UTF-8 in the grammar definition, or as new \u{10ABCD}
style escapes), as well as to provide alternatives to AntlrInputStream
and friends which understand how to handle Unicode values > U+FFFF.
Sounds good. I'm only interested in the Java runtime right now, so I can workaround for now.
Huzzah!
The pasted file contains a definition for the unicode XID_START character set. Processing it with the current git master causes this error:
... and, here's the file Bad.g4. It's 430 lines, just in case my copy buffer lost part of it: