antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.27k stars 3.29k forks source link

Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification #4640

Open AkhilAkkapelli opened 5 months ago

AkhilAkkapelli commented 5 months ago

I am developing a Fortran 2018 grammar in ANTLR4 using the ISO standard. I am encountering an issue during the lexing phase with some of the lexer rules. Specifically, certain keywords are being misclassified. Below is the minimal grammar demonstrating the problem:

Grammar: FortranTestF18.g4

grammar FortranTestF18;

//LEXER RULES

LINE_COMMENT : '!' .*? '\r'? '\n' -> skip ;

BLOCK_COMMENT: '/*' .*? '*/' -> skip;

WS: [ \t\r\n]+ -> skip;

PROGRAM: 'PROGRAM' | 'Program' | 'program';

END: 'END' | 'End' | 'end';

COMMA: ',';

LPAREN: '(';

RPAREN: ')';

ASTERIK: '*';

NONE: 'NONE' | 'None' | 'none';

IMPLICIT: 'IMPLICIT' | 'Implicit' | 'implicit';

FORMAT: 'FORMAT' | 'Format' | 'format';

PLUS: '+';

// R765 binary-constant -> B ' digit [digit]... ' | B " digit [digit]... "
BINARYCONSTANT: B APOSTROPHE DIGIT+ APOSTROPHE | B QUOTE DIGIT+ QUOTE;

// R766 octal-constant -> O ' digit [digit]... ' | O " digit [digit]... "
OCTALCONSTANT: O APOSTROPHE DIGIT+ APOSTROPHE | O QUOTE DIGIT+ QUOTE;

//R0003 RepChar
APOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F\u0027])*  APOSTROPHE;

QUOTEREPCHAR: QUOTE (~[\u0000-\u001F\u0022])*  QUOTE;

APOSTROPHE: '\'';

QUOTE: '"';

DOT: '.';

C: 'C';

// R603 name -> letter [alphanumeric-character]...
NAME: LETTER (ALPHANUMERICCHARACTER)*;

// R711 digit-string -> digit [digit]...
DIGITSTRING: DIGIT+; 

MINUS: '-';

B: 'B';

O: 'O';

Z: 'Z';

A: 'A';

F: 'F';

D: 'D';

E: 'E';

I: 'I';

G: 'G';

L: 'L';

DT: 'DT';

EN: 'EN';

ES: 'ES';

EX: 'EX';

T: 'T';

TL: 'TL';

TR: 'TR';

X: 'X';

SS: 'SS';

SP: 'SP';

S: 'S';

BN: 'BN';

BZ: 'BZ';

RU: 'RU';

RD: 'RD';

RZ: 'RZ';

RN: 'RN';

RC: 'RC';

RP: 'RP';

DC: 'DC';

DP: 'DP';

P: 'P';

// R602 UNDERSCORE -> _
UNDERSCORE: '_';

// R601 alphanumeric-character -> letter | digit | underscore
ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;

// R0002 Letter ->
//         A | B | C | D | E | F | G | H | I | J | K | L | M |
//         N | O | P | Q | R | S | T | U | V | W | X | Y | Z
LETTER: 'A'..'Z' | 'a'..'z'; 

// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
DIGIT: '0'..'9';

//PARSER RULES

programName: NAME;

// R1402 program-stmt -> PROGRAM program-name
programStmt: PROGRAM programName;

typeName: NAME;

// R516 keyword -> name
keyword: NAME;

// R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]
implicitStmt:
        IMPLICIT NONE;

// R709 kind-param -> digit-string | scalar-int-constant-name
kindParam: DIGITSTRING;

// R708 int-literal-constant -> digit-string [_ kind-param]
intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;

// R712 sign -> + | -
sign: PLUS | MINUS;

// R707 signed-int-literal-constant -> [sign] int-literal-constant
signedIntLiteralConstant: sign? intLiteralConstant;

// R1306 r -> int-literal-constant
r: intLiteralConstant;

// R1308 w -> int-literal-constant
w: intLiteralConstant;

// R1309 m -> int-literal-constant
m: intLiteralConstant;

// R1310 d -> int-literal-constant
d: intLiteralConstant;

// R1311 e -> int-literal-constant
e: intLiteralConstant;

// R1312 v -> signed-int-literal-constant
v: signedIntLiteralConstant;

vList: v (COMMA v)*;

// R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "
charLiteralConstant: 
        (kindParam UNDERSCORE)? APOSTROPHEREPCHAR
    | (kindParam UNDERSCORE)? QUOTEREPCHAR;

// R1307 data-edit-desc ->
//         I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |
//         E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |
//         G w [. d [E e]] | L w | A [w] | D w . d |
//         DT [char-literal-constant] [( v-list )]
dataEditDesc:
    I w (DOT m)? |
    B w (DOT m)? |
    O w (DOT m)? |
    Z w (DOT m)? |
    F w DOT d |
    E w DOT d ( E e )? |
    EN w DOT d ( E e )? |
    ES w DOT d ( E e )? |
    EX w DOT d ( E e )? |
    G w (DOT d ( E e )?)? |
    L w |
    A w? |
    D w DOT d |
    DT charLiteralConstant? ( LPAREN vList RPAREN )?;

// R1304 format-item ->
//         [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )
formatItem: r? dataEditDesc;

// R1303 format-items -> format-item [[,] format-item]...
formatItems: formatItem (COMMA? formatItem)*;

// R1305 unlimited-format-item -> * ( format-items )
unlimitedFormatItem: ASTERIK LPAREN formatItems RPAREN;

// R1302 format-specification ->
//         ( [format-items] ) | ( [format-items ,] unlimited-format-item )
formatSpecification:
    LPAREN formatItems? RPAREN |  LPAREN (formatItems COMMA)? unlimitedFormatItem  RPAREN;

// R1301 format-stmt -> FORMAT format-specification
formatStmt: FORMAT formatSpecification;

//R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmt
implicitPartStmt:
      implicitStmt
    | formatStmt;

//R505 implicit-part -> [implicit-part-stmt]... implicit-stmt
implicitPart: (implicitPartStmt)* implicitStmt;

//R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]
// [declaration-construct]...
  specificationPart:
    (implicitPart)?;

// R1403 end-program-stmt -> END [PROGRAM [program-name]]
endProgramStmt: END (PROGRAM programName?)?;

// R1401 main-program ->
//         [program-stmt] [specification-part] [execution-part]
//         [internal-subprogram-part] end-program-stmt
///COMMENT: WHY ? after programStmt
  mainProgram:
      programStmt? specificationPart? endProgramStmt;

//R502 program-unit -> main-program | external-subprogram | module | submodule | block-data
programUnit:
    mainProgram;

//R501 program -> program-unit [program-unit]...    
program: programUnit (programUnit)*;      

Test File: FortranTest.f90

FORMAT(I 12)

Commands:

antlr4 FortranTestF18.g4 
javac *.java
grun FortranTestF18 formatStmt -tokens FortranTest.f90 

Grun Output:

[@0,0:5='FORMAT',<FORMAT>,1:0]
[@1,6:6='(',<'('>,1:6]
[@2,7:7='I',<NAME>,1:7]
[@3,9:10='12',<DIGITSTRING>,1:9]
[@4,11:11=')',<')'>,1:11]
[@5,12:11='<EOF>',<EOF>,1:12]
line 1:7 no viable alternative at input '(I'

Here, token I is recognized as NAME but I want it to be recognized as token I: 'I';. But if I move the lexer rule I to top of NAME then the identifiers cannot be named as 'I'. How do I solve this problem?

jimidle commented 5 months ago

You need to take a step back before attempting this. You are trying to construct a grammar from a normative spec instead of thinking about the how grammar should be compatible with the normative spec. Start with this:

You will get nowhere with the lexer and parser you have right now and it will frustrate you. Take a look at some existing grammars to get a feel for it, and write yourself some small parsers like a calculator or equation parser or something else simple. The mistakes you make on the small tasks wil lguide you in creation of a larger system such as a Fortran parser.

Beware of X Y questions, which is what you have here.