ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
I was intrigued by this in documentation re UnbufferedCharStream:
Do not buffer up the entire char stream. It does keep a small buffer for efficiency and also buffers while a mark exists (set by the lookahead prediction in parser). "Unbuffered" here refers to fact that it doesn't buffer all data, not that's it's on demand loading of char. Before 4.7, this class used the default environment encoding to convert bytes to UTF-16, and held the UTF-16 bytes in the buffer as chars. As of 4.7, the class uses UTF-8 by default, and the buffer holds Unicode code points in the buffer as ints.
As I look into it (with a mind to porting to the antlr4ts target, I recognize the docs are wrong. There's nothing to do with UTF-8 encoding in this code (other than the one doc comment!)
Everything this class is doing seems to be associated with UTF-16, and perhaps the clarification that should be made in the docs is that in 4.7 and attempt to fully implement UTF-16 (including surrogate pairs) where as the previous version really only dealt with UCS2 (which lacked surrogate pairs.)
I was intrigued by this in documentation re UnbufferedCharStream:
As I look into it (with a mind to porting to the
antlr4ts
target, I recognize the docs are wrong. There's nothing to do with UTF-8 encoding in this code (other than the one doc comment!)Everything this class is doing seems to be associated with UTF-16, and perhaps the clarification that should be made in the docs is that in 4.7 and attempt to fully implement UTF-16 (including surrogate pairs) where as the previous version really only dealt with UCS2 (which lacked surrogate pairs.)