antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.17k stars 3.28k forks source link

Java UnbufferedCharStream and UTF-8 vs UTF-16 #1899

Open BurtHarris opened 7 years ago

BurtHarris commented 7 years ago

I was intrigued by this in documentation re UnbufferedCharStream:

Do not buffer up the entire char stream. It does keep a small buffer for efficiency and also buffers while a mark exists (set by the lookahead prediction in parser). "Unbuffered" here refers to fact that it doesn't buffer all data, not that's it's on demand loading of char. Before 4.7, this class used the default environment encoding to convert bytes to UTF-16, and held the UTF-16 bytes in the buffer as chars. As of 4.7, the class uses UTF-8 by default, and the buffer holds Unicode code points in the buffer as ints.

As I look into it (with a mind to porting to the antlr4ts target, I recognize the docs are wrong. There's nothing to do with UTF-8 encoding in this code (other than the one doc comment!)

Everything this class is doing seems to be associated with UTF-16, and perhaps the clarification that should be made in the docs is that in 4.7 and attempt to fully implement UTF-16 (including surrogate pairs) where as the previous version really only dealt with UCS2 (which lacked surrogate pairs.)

BurtHarris commented 7 years ago

In the end this is probably just a doc bug, unless you really intended on implementing UTF-8.