antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.11k stars 3.28k forks source link

Incorrect Lexer Generation in Dart #3894

Open Dokotela opened 2 years ago

Dokotela commented 2 years ago

I have an issue where I seem to get the incorrect tokens for one of my inputs when I'm using antlr4 in Dart.

import 'fhirpathLexer.dart'; import 'fhirpathParser.dart';

void main() { final input = InputStream.fromString("Patient.text.div"); final lexer = fhirpathLexer(input); // print(lexer.allTokens); final tokens = CommonTokenStream(lexer); final parser = fhirpathParser(tokens); parser.buildParseTree = true; final tree = parser.expression(); }

You will see that it gives the following error message: 

line 1:13 mismatched input 'div' expecting {'is', 'as', 'in', 'contains', '$this', '$index', '$total', IDENTIFIER, DELIMITEDIDENTIFIER} line 1:16 mismatched input '' expecting {'+', '-', 'is', 'as', 'in', 'contains', '(', '{', 'true', 'false', '%', '$this', '$index', '$total', DATE, DATETIME, TIME, IDENTIFIER, DELIMITEDIDENTIFIER, STRING, NUMBER}

If I uncomment out the lexer.allTokens line from the function, it seems to demonstrate where the problem is. 

[[@-1,0:6='Patient',<58>,1:0], [@-1,7:7='.',<1>,1:7], [@-1,8:11='text',<58>,1:8], [@-1,12:12='.',<1>,1:12], [@-1,13:15='div',<8>,1:13]]

So instead of interpreting ```'div'``` as an ```IDENTIFIER <58>```, like it does with ```'Patient'``` and ```'text'```, it interprets it as a ```#multiplicativeExpression```. Which it can be, but not in this case. I can understand how it would make this error, the first part of the grammar is:

expression : term #termExpression | expression '.' invocation #invocationExpression | expression '[' expression ']' #indexerExpression | ('+' | '-') expression #polarityExpression | expression ('*' | '/' | 'div' | 'mod') expression #multiplicativeExpression

I have simplified the above grammar as much as I can and still produce the bug, it's here:
```antlr4
grammar fhirpath;

expression
        : term                                                      #termExpression
        | expression '.' invocation                                 #invocationExpression
        | expression ('*' | '/' | 'div' | 'mod') expression         #multiplicativeExpression
        ;

term
        : invocation                                            #invocationTerm
        | literal                                               #literalTerm
        ;

literal
        : STRING                                                #stringLiteral
        ;

invocation                          // Terms that can be used after the function/member invocation '.'
        : identifier                                            #memberInvocation
        ;

identifier
        : IDENTIFIER
        ;

IDENTIFIER
        : ([A-Za-z] | '_')([A-Za-z0-9] | '_')*            // Added _ to support CQL (FHIR could constrain it out)
        ;

STRING
        : '\'' (ESC | .)*? '\''
        ;

fragment ESC
        : '\\' ([`'\\/fnrt] | UNICODE)    // allow \`, \', \\, \/, \f, etc. and \uXXX
        ;

fragment UNICODE
        : 'u' HEX HEX HEX HEX
        ;

fragment HEX
        : [0-9a-fA-F]
        ;

Thank you for any help or suggestions.

ericvergnaud commented 2 years ago

Hi,It seems that you are expecting antlr4 to produce different tokens depending on the context.This is not supported by antlr4 , tokens are produced prior to being submitter to parsing rules.I strongly suggest that you split your grammar between Lexer and parser rules , such that you can control the precedence of token production.And since ‘div’ is both a token and a valid identifier. You probably need a grammar rule as follows:identifier: IDENTIFIER | DIV;Envoyé de mon iPhoneLe 22 sept. 2022 à 05:00, Grey Faulkenberry, MD MPH @.***> a écrit : I have an issue where I seem to get the incorrect tokens for one of my inputs when I'm using antlr4 in Dart.

To start, I have installed ANTLR Parser Generator Version 4.11.1. I'm using the following grammar: http://hl7.org/fhirpath/N1/fhirpath.g4 I generate the files using the following expression: java org.antlr.v4.Tool -Dlanguage=Dart -no-listener -visitor fhirpath.g4 All of the files seem to generate correctly, and in general I thought they were working well. But then when I use the input String "Patient.text.div" I get an error. To demonstrate this, if you generate the files as I did above, and then run this function:

import 'package:antlr4/antlr4.dart';

import 'fhirpathLexer.dart'; import 'fhirpathParser.dart';

void main() { final input = InputStream.fromString("Patient.text.div"); final lexer = fhirpathLexer(input); // print(lexer.allTokens); final tokens = CommonTokenStream(lexer); final parser = fhirpathParser(tokens); parser.buildParseTree = true; final tree = parser.expression(); } You will see that it gives the following error message: line 1:13 mismatched input 'div' expecting {'is', 'as', 'in', 'contains', '$this', '$index', '$total', IDENTIFIER, DELIMITEDIDENTIFIER} line 1:16 mismatched input '' expecting {'+', '-', 'is', 'as', 'in', 'contains', '(', '{', 'true', 'false', '%', '$this', '$index', '$total', DATE, DATETIME, TIME, IDENTIFIER, DELIMITEDIDENTIFIER, STRING, NUMBER}

If I uncomment out the lexer.allTokens line from the function, it seems to demonstrate where the problem is. @.,0:6='Patient',<58>,1:0], @.,7:7='.',<1>,1:7], @.,8:11='text',<58>,1:8], @.,12:12='.',<1>,1:12], @.***,13:15='div',<8>,1:13]]

So instead of interpreting 'div' as an IDENTIFIER <58>, like it does with 'Patient' and 'text', it interprets it as a #multiplicativeExpression. Which it can be, but not in this case. I can understand how it would make this error, the first part of the grammar is: expression : term #termExpression | expression '.' invocation #invocationExpression | expression '[' expression ']' #indexerExpression | ('+' | '-') expression #polarityExpression | expression ('*' | '/' | 'div' | 'mod') expression #multiplicativeExpression

I have simplified the above grammar as much as I can and still produce the bug, it's here: grammar fhirpath;

expression : term #termExpression | expression '.' invocation #invocationExpression | expression ('*' | '/' | 'div' | 'mod') expression #multiplicativeExpression ;

term : invocation #invocationTerm | literal #literalTerm ;

literal : STRING #stringLiteral ;

invocation // Terms that can be used after the function/member invocation '.' : identifier #memberInvocation ;

identifier : IDENTIFIER ;

IDENTIFIER : ([A-Za-z] | '')([A-Za-z0-9] | '')* // Added _ to support CQL (FHIR could constrain it out) ;

STRING : '\'' (ESC | .)*? '\'' ;

fragment ESC : '\' (['\\/fnrt] | UNICODE) // allow \, \', \, \/, \f, etc. and \uXXX ;

fragment UNICODE : 'u' HEX HEX HEX HEX ;

fragment HEX : [0-9a-fA-F] ;

Thank you for any help or suggestions.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>