Closed KvanTTT closed 8 years ago
It's probably easier to just convert the \r to \n before you parse it.
On Wed, Jan 27, 2016 at 2:10 AM, Ivan Kochurkin notifications@github.com wrote:
I have the following grammar:
lexer grammar T;ID: [a-z]+;WS: [ \n] -> skip;
And the following input: "a" + newLine + "b" + newLine + "c"
- If newLine is \r\n, I got this:
[@0,0:0='a',<1>,1:0] [@1,3:3='b',<1>,2:0] [@2,6:6='c',<1>,3:0] [@3,7:6='
',<-1>,3:1] newLine == \n:
[@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,2:0] [@2,4:4='c',<1>,3:0] [@3,5:4='
',<-1>,3:1] newLine == \r:
[@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,1:2] [@2,4:4='c',<1>,1:4] [@3,5:4='
',<-1>,1:5] In last case line numer for all tokens is 1 and it's wrong. I suggest to support '\r' newLine character. Another variants described at Wikipedia https://en.wikipedia.org/wiki/Newline#Representations, but they are too rarely used.
— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/1098.
I thing it's not a very good idea. In this case \r\n
will be treated as two lines, whereas it's only one line. Moreover it's better to leave input stream as is.
There is no universally applicable solution for this - you will always find an exception case and you start to add little bits of overhead trying to cater for everyone. In the C runtime I just looked for a newline and ignored CR, but made it so that you could change the triggering character at runtime. That's about the best you can do generically I think. Checking for a following \n after a \r is all adding overhead, and where do you stop.
You could add your own code to supply tokens and change the line numbers based on your own tracking, but I would just change the \r to \n on the fly. You could do that with your own input stream for instance.
Jim
On Wed, Jan 27, 2016 at 3:37 PM, Ivan Kochurkin notifications@github.com wrote:
I thing it's not a very good idea. In this case \r\n will be treated as two lines, whereas it's only one line. Moreover it's better to leave input stream as is.
— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/1098#issuecomment-175465236.
Hi,
historically, ‘\r’ is not a new line, but rather a carriage return, hence the ‘\r\n’ sequence. so not sure where you take your input from, but you might be better fixing the input.
Eric
Le 27 janv. 2016 à 02:10, Ivan Kochurkin notifications@github.com a écrit :
I have the following grammar:
lexer grammar T; ID: [a-z]+; WS: [ \n] -> skip; And the following input: "a" + newLine + "b" + newLine + "c"
If newLine is \r\n, I got this: [@0,0:0='a',<1>,1:0] [@1,3:3='b',<1>,2:0] [@2,6:6='c',<1>,3:0] [@3,7:6='
',<-1>,3:1] newLine == \n: [@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,2:0] [@2,4:4='c',<1>,3:0] [@3,5:4='
',<-1>,3:1] newLine == \r: [@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,1:2] [@2,4:4='c',<1>,1:4] [@3,5:4='
',<-1>,1:5] In last case line numer for all tokens is 1 and it's wrong. I suggest to support '\r' newLine character. Another variants described at Wikipedia https://en.wikipedia.org/wiki/Newline#Representations, but they are too rarely used. — Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/1098.
'\r' alone was pre-OSX mac :)
historically, ‘\r’ is not a new line, but rather a carriage return, hence the ‘\r\n’ sequence.
Not completely true.
From Wiki: Commodore 8-bit machines (C64, C128), Acorn BBC, ZX Spectrum, TRS-80, Apple II series, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9 uses \r
as the new line separator. But there are also more rare line separators as \036
, \025
that I have never seen.
But it looks like many users encounter the problem with \r
separator. Also all text editors and IDE that I've tested support such line separator (Notepad++, VS Code, IntelliJ, even Windows Notepad). Thus, it's not a very rare line separator.
I suggest supporting it especially since the fix is quite simple.
I am looking into this situation, and wondering if you could use a custom ICharStream
implementation that deals appropriately with this (if configured to deal with \r
only files, for instance). But my knowledge of ICharStream
is quite limited. Any idea/suggestion on this?
I have the following grammar:
And the following input:
"a" + newLine + "b" + newLine + "c"
newLine == \n:
newLine == \r:
In last case line numer for all tokens is 1 and it's wrong. I suggest to support '\r' newLine character. Another variants described at Wikipedia, but they are too rarely used. Moreover '\r' newline supported by most part of modern editors (IDEA, Visual Studio, Notepad++ etc.).