Incorrect tokens line number for files with '\r' newline.

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

16.99k stars 3.26k forks source link

Incorrect tokens line number for files with '\r' newline. #1098

Closed KvanTTT closed 8 years ago

KvanTTT commented 8 years ago

I have the following grammar:

lexer grammar T;
ID: [a-z]+;
WS: [ \n] -> skip;

And the following input: "a" + newLine + "b" + newLine + "c"

If newLine is \r\n, I got this:

[@0,0:0='a',<1>,1:0]
[@1,3:3='b',<1>,2:0]
[@2,6:6='c',<1>,3:0]
[@3,7:6='<EOF>',<-1>,3:1]

newLine == \n:

[@0,0:0='a',<1>,1:0]
[@1,2:2='b',<1>,2:0]
[@2,4:4='c',<1>,3:0]
[@3,5:4='<EOF>',<-1>,3:1]

newLine == \r:

[@0,0:0='a',<1>,1:0]
[@1,2:2='b',<1>,1:2]
[@2,4:4='c',<1>,1:4]
[@3,5:4='<EOF>',<-1>,1:5]

In last case line numer for all tokens is 1 and it's wrong. I suggest to support '\r' newLine character. Another variants described at Wikipedia, but they are too rarely used. Moreover '\r' newline supported by most part of modern editors (IDEA, Visual Studio, Notepad++ etc.).

jimidle commented 8 years ago

It's probably easier to just convert the \r to \n before you parse it.

On Wed, Jan 27, 2016 at 2:10 AM, Ivan Kochurkin notifications@github.com wrote:

I have the following grammar:

lexer grammar T;ID: [a-z]+;WS: [ \n] -> skip;

And the following input: "a" + newLine + "b" + newLine + "c"

If newLine is \r\n, I got this:

[@0,0:0='a',<1>,1:0] [@1,3:3='b',<1>,2:0] [@2,6:6='c',<1>,3:0] [@3,7:6='',<-1>,3:1]

newLine == \n:

[@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,2:0] [@2,4:4='c',<1>,3:0] [@3,5:4='',<-1>,3:1]

newLine == \r:

[@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,1:2] [@2,4:4='c',<1>,1:4] [@3,5:4='',<-1>,1:5]

In last case line numer for all tokens is 1 and it's wrong. I suggest to support '\r' newLine character. Another variants described at Wikipedia https://en.wikipedia.org/wiki/Newline#Representations, but they are too rarely used.

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/1098.

KvanTTT commented 8 years ago

I thing it's not a very good idea. In this case \r\n will be treated as two lines, whereas it's only one line. Moreover it's better to leave input stream as is.

jimidle commented 8 years ago

There is no universally applicable solution for this - you will always find an exception case and you start to add little bits of overhead trying to cater for everyone. In the C runtime I just looked for a newline and ignored CR, but made it so that you could change the triggering character at runtime. That's about the best you can do generically I think. Checking for a following \n after a \r is all adding overhead, and where do you stop.

You could add your own code to supply tokens and change the line numbers based on your own tracking, but I would just change the \r to \n on the fly. You could do that with your own input stream for instance.

Jim

On Wed, Jan 27, 2016 at 3:37 PM, Ivan Kochurkin notifications@github.com wrote:

I thing it's not a very good idea. In this case \r\n will be treated as two lines, whereas it's only one line. Moreover it's better to leave input stream as is.

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/1098#issuecomment-175465236.

ericvergnaud commented 8 years ago

Hi,

historically, ‘\r’ is not a new line, but rather a carriage return, hence the ‘\r\n’ sequence. so not sure where you take your input from, but you might be better fixing the input.

Eric

Le 27 janv. 2016 à 02:10, Ivan Kochurkin notifications@github.com a écrit :

I have the following grammar:

lexer grammar T; ID: [a-z]+; WS: [ \n] -> skip; And the following input: "a" + newLine + "b" + newLine + "c"

If newLine is \r\n, I got this: [@0,0:0='a',<1>,1:0] [@1,3:3='b',<1>,2:0] [@2,6:6='c',<1>,3:0] [@3,7:6='',<-1>,3:1] newLine == \n:

[@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,2:0] [@2,4:4='c',<1>,3:0] [@3,5:4='',<-1>,3:1] newLine == \r:

[@0,0:0='a',<1>,1:0] [@1,2:2='b',<1>,1:2] [@2,4:4='c',<1>,1:4] [@3,5:4='',<-1>,1:5] In last case line numer for all tokens is 1 and it's wrong. I suggest to support '\r' newLine character. Another variants described at Wikipedia https://en.wikipedia.org/wiki/Newline#Representations, but they are too rarely used.

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/1098.

parrt commented 8 years ago

'\r' alone was pre-OSX mac :)

KvanTTT commented 2 years ago

historically, ‘\r’ is not a new line, but rather a carriage return, hence the ‘\r\n’ sequence.

Not completely true.

From Wiki: Commodore 8-bit machines (C64, C128), Acorn BBC, ZX Spectrum, TRS-80, Apple II series, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9 uses \r as the new line separator. But there are also more rare line separators as \036, \025 that I have never seen.

But it looks like many users encounter the problem with \r separator. Also all text editors and IDE that I've tested support such line separator (Notepad++, VS Code, IntelliJ, even Windows Notepad). Thus, it's not a very rare line separator.

I suggest supporting it especially since the fix is quite simple.

cxambs commented 1 year ago

I am looking into this situation, and wondering if you could use a custom ICharStream implementation that deals appropriately with this (if configured to deal with \r only files, for instance). But my knowledge of ICharStream is quite limited. Any idea/suggestion on this?