antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.11k stars 3.28k forks source link

Core dump or infinite loop #4073

Open Korporal opened 1 year ago

Korporal commented 1 year ago

image

Windows 11.

The code was generated and compiled fine, but the test tool crashed (with very simple inputs) and repeated attempts lead to a hang, CPU bound loop impacting most even all cores:

image

Killing the app stops all CPU use and it returns to a normal state, almost totally idle machine. I can't see any obvious reason why the initial attempt caused a crash but repeated subsequent attempts lead to a hang, I cannot get it to crash again, only to hang now.

This was the crude g4:

grammar striterals;

literal: QUOTE TEXT QUOTE ;

QUOTE: '"' ;
TEXT:  [a-z]*;

This was the input test file: (2 lines, last one empty)

"kahskjahska"

Removing that last empty line "fixes" it, it runs fine and parses the text:

image

Note however, even with the last empty line removed, putting my edit cursor in the middle of the text and splitting that line also leads to a hang:

image

I suspect this will be easy to reproduce.

kaby76 commented 1 year ago

Yes, I can reproduce it with the GUI tool.

01/13-22:26:54 ~/crude/Generated-Java
$ java -cp "c:/Users/Kenne/Downloads/antlr4-4.11.1-complete.jar;." org.antlr.v4.gui.TestRig crude literal -gui i1
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at org.antlr.v4.runtime.CommonTokenFactory.create(CommonTokenFactory.java:70)
        at org.antlr.v4.runtime.CommonTokenFactory.create(CommonTokenFactory.java:16)
        at org.antlr.v4.runtime.Lexer.emit(Lexer.java:245)
        at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:156)
        at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
        at org.antlr.v4.runtime.BufferedTokenStream.fill(BufferedTokenStream.java:485)
        at org.antlr.v4.gui.TestRig.process(TestRig.java:174)
        at org.antlr.v4.gui.TestRig.process(TestRig.java:166)
        at org.antlr.v4.gui.TestRig.main(TestRig.java:119)

But, the Antlr tool does say something is bad in the grammar:

01/13-22:21:09 ~/crude
$ trgen -s literal -t Java
CSharp  crude.g4 success 0.0463683
Rendering template file from Java/build.ps1 to Generated-Java/build.ps1
Rendering template file from Java/ErrorListener.java to Generated-Java/ErrorListener.java
Rendering template file from Java/makefile to Generated-Java/makefile
Rendering template file from Java/Test.java to Generated-Java/Test.java
Rendering template file from Java/test.ps1 to Generated-Java/test.ps1
Rendering template file from Java/test.sh to Generated-Java/test.sh
Copying source file from C:/msys64/home/Kenne/crude/crude.g4 to Generated-Java/crude.g4
01/13-22:21:21 ~/crude
$ cd Generated-Java/
01/13-22:21:23 ~/crude/Generated-Java
$ make
java -jar C:/Users/Kenne/.m2/antlr4-4.11.1-complete.jar -encoding utf-8  crude.g4
warning(146): crude.g4:4:0: non-fragment lexer rule TEXT can match the empty string
javac -cp C:/Users/Kenne/.m2/antlr4-4.11.1-complete.jar\;. crudeParser.java
javac -cp C:/Users/Kenne/.m2/antlr4-4.11.1-complete.jar\;. crudeLexer.java
javac -cp C:/Users/Kenne/.m2/antlr4-4.11.1-complete.jar\;. Test.java
01/13-22:21:27 ~/crude/Generated-Java
$ cat crude.g4
grammar crude;
literal: QUOTE TEXT QUOTE ;
QUOTE: '"' ;
TEXT:  [a-z]*;
01/13-22:21:49 ~/crude/Generated-Java

You don't want to have a lexer rule match an empty string ever. Should be a fatal error. Best process as java -jar C:/Users/Kenne/.m2/antlr4-4.11.1-complete.jar -encoding utf-8 -Werror crude.g4.

Korporal commented 1 year ago

@kaby76 - Yes the grammar was broken I agree, still, the tool should never crash no matter what the input is, I assume that's a policy for these tools? So this is a bug it seems, minor of course, but a bug.

kaby76 commented 1 year ago

@kaby76 - Yes the grammar was broken I agree, still, the tool should never crash no matter what the input is, I assume that's a policy for these tools? So this is a bug it seems, minor of course, but a bug.

Seems like should be a fatal error in the tool. That said, sometimes one can create a token that has zero length and emit that in a base class lexer. Obviously, the semantics of zero-length tokens is not defined.