GerHobbelt / jison

bison / YACC / LEX in JavaScript (LALR(1), SLR(1), etc. lexer/parser generator)
https://gerhobbelt.github.io/jison/
MIT License
118 stars 20 forks source link

Problem when detecting multiple token #35

Closed juliandavidmr closed 6 years ago

juliandavidmr commented 6 years ago

I am doing a basic transpiler, it consists in translating certain syntax with already defined grammar. I still have some drawbacks when defining the detection of multiple tokens. That is, using to specify the repetition of a regex token _(I think the `is used to define the repetition of a token)_, for example (seeSENTENCE*`):

FUNCTION
    : DEF ID PAR_OPEN PAR_CLOSE
        SENTENCE*
      END
        { $$ = 'function ' + $2 + '(){' + $5 + '}' };

SENTENCE
    : PRINT
    | VAR_ASSIGN
    |;

Grammar works with the input:

def hello()
    println "dsasd"
end

but it does not work with the entry:

def hello()
    a = 3
    println "dsasd"
end

The error thrown is:

Error: Parse error on line 3:
...llo()    a = 3    println "dsasd"end
---------------------^
Expecting 'EOF', '+', 'OR', 'AND', '=', '<>', '-', '*', '/', '>', '<', '>=', '<=','^', 'PAR_CLOSE', '%', 'END', got 'PRINTLN'

Could you tell me what I'm doing wrong?

See Gist complete code

GerHobbelt commented 6 years ago

Given your example inputs, we're looking at a language which has "significant whitespace" as part of the language definition (such as python), which your lexer and/or parser does not seem to take into account given the error message: I don't see anything in the tokenlist there that hints you're also looking for a 'End Of Statement' token, i.e. a statement terminator.

Then having a look at the precise grammar, it turns out there's at least one ambiguity in there: you accept an EMPTY statement while you also have * ZERO-OR-MORE EBNF operator in there, so I'ld write it like this instead:

FUNCTION
    : DEF ID PAR_OPEN PAR_CLOSE
        SENTENCE*
      END
;

SENTENCE
    : PRINT
    | VAR_ASSIGN
;

Next, and this is probably the culprit, is hinted at by the error message you report, when compared to the grammar: there's no PRINTLN token recognized by this grammar's SENTENCE rule at all? For I wouldn't expect this error when this bit of grammar would be present too:

PRINT
  : PRINTLN STRING
;

(... having a look at the gist now ...)

Okay, there's other/more bits slightly wrong there:

$ node dist/cli-cjs.js examples/issue-35-gho.jison --main
W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:26559
            throw err;
            ^

JisonLexerError: Lexical error on line 123:
unsupported parser input: "*"
while lexing in "bnf" state.

  Erroneous area:
120:
121: FUNCTION
122:     : DEF ID PAR_OPEN PAR_CLOSE
123:         SENTENCE*
^^^..................^
124:       END
125:         { $$ = 'function ' + $2 + '(){' + $5 + '}' }
    at Object.parseError (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:21094:15)
    at Object.lexer_parseError [as parseError] (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:22995:44)
    at Object.yyError [as yyerror] (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:23029:19)
    at Object.lexer__performAction [as performAction] (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:24676:13)
    at Object.lexer_test_match [as test_match] (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:23858:34)
    at Object.lexer_next [as next] (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:23976:22)
    at Object.lexer_fastLex [as fastLex] (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:24080:18)
    at fastLex (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:21656:27)
    at Object.parse (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:21822:30)
    at Object.parse (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\dist\cli-cjs.js:25321:23)

which, after fixing, moves on to the next bit: the above-mentioned ambiguity and a few more:

$ node dist/cli-cjs.js examples/issue-35-gho.jison --main
WARNING: missing actions for states:  [ 34, 35, 36, 37, 38, 39 ]
WARNING: missing actions for states:  [ 34, 35, 36, 37, 38, 39 ]
WARNING: missing actions for states:  [ 34, 35, 36, 37, 38, 39 ]

States with conflicts:

State 51    (ID @ SENTENCE -> .)

   FUNCTION -> DEF ID PAR_OPEN PAR_CLOSE FUNCTION_repetition .END   #lookaheads= [EOF]
  FUNCTION_repetition -> FUNCTION_repetition .SENTENCE   #lookaheads= [END]  [PRINTLN]  [ID]
  SENTENCE -> .PRINT                     #lookaheads= [END]  [PRINTLN]  [ID]
  SENTENCE -> .VAR_ASSIGN                #lookaheads= [END]  [PRINTLN]  [ID]
  SENTENCE -> .                          #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> .PRINTLN PRINT_option e PRINT_option2   #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> .PRINTLN e                    #lookaheads= [END]  [PRINTLN]  [ID]
  VAR_ASSIGN -> .ID ASSIGN e             #lookaheads= [END]  [PRINTLN]  [ID]

State 56    (PI @ PRINT_option -> .)

   PRINT -> PRINTLN .PRINT_option e PRINT_option2   #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> PRINTLN .e                    #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT_option -> .                      #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  PRINT_option -> .PAR_OPEN              #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  e -> .e + e
  e -> .NOT e
  e -> .e OR e
  e -> .e AND e
  e -> .e = e
  e -> .e <> e
  e -> .e - e
  e -> .e * e
  e -> .e / e
  e -> .e > e
  e -> .e < e
  e -> .e >= e
  e -> .e <= e
  e -> .e ^ e
  e -> .- e
  e -> .( e PAR_CLOSE
  e -> .e %
  e -> .TRUE
  e -> .FALSE
  e -> .NUMBER
  e -> .STRING
  e -> .IDENT
  e -> .E
  e -> .PI

----------------------------------- NOTICE -------------------------------
Attempting to resolve the unresolved conflicts in partial LR mode...

When no conflicts are reported in the next round below, your grammar is
accepted as mixed LR/LALR and should work as expected.
--------------------------------------------------------------------------

States with conflicts:

State 51    (ID @ SENTENCE -> .)

   FUNCTION -> DEF ID PAR_OPEN PAR_CLOSE FUNCTION_repetition .END   #lookaheads= [EOF]
  FUNCTION_repetition -> FUNCTION_repetition .SENTENCE   #lookaheads= [END]  [PRINTLN]  [ID]
  SENTENCE -> .PRINT                     #lookaheads= [END]  [PRINTLN]  [ID]
  SENTENCE -> .VAR_ASSIGN                #lookaheads= [END]  [PRINTLN]  [ID]
  SENTENCE -> .                          #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> .PRINTLN PRINT_option e PRINT_option2   #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> .PRINTLN e                    #lookaheads= [END]  [PRINTLN]  [ID]
  VAR_ASSIGN -> .ID ASSIGN e             #lookaheads= [END]  [PRINTLN]  [ID]

State 56    (PI @ PRINT_option -> .)

   PRINT -> PRINTLN .PRINT_option e PRINT_option2   #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> PRINTLN .e                    #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT_option -> .                      #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  PRINT_option -> .PAR_OPEN              #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  e -> .e + e
  e -> .NOT e
  e -> .e OR e
  e -> .e AND e
  e -> .e = e
  e -> .e <> e
  e -> .e - e
  e -> .e * e
  e -> .e / e
  e -> .e > e
  e -> .e < e
  e -> .e >= e
  e -> .e <= e
  e -> .e ^ e
  e -> .- e
  e -> .( e PAR_CLOSE
  e -> .e %
  e -> .TRUE
  e -> .FALSE
  e -> .NUMBER
  e -> .STRING
  e -> .IDENT
  e -> .E
  e -> .PI

Unused productions in your grammar:
  UNARY_OPERATOR[*RIP*] -> &
  UNARY_OPERATOR[@3 *RIP*] -> *
  UNARY_OPERATOR[@2 *RIP*] -> +
  UNARY_OPERATOR[@2 *RIP*] -> -
  UNARY_OPERATOR[*RIP*] -> ~
  UNARY_OPERATOR[*RIP*] -> !

JISON output for module [issue_35Gho] has been written to file: issue-35-gho.js

which translates to your grammar not being LALR(1) nor LR(1) (which is the mode for the compile retry phase mentioned in the error report).

Removing the ambiguous handling of the empty statement list as mentioned above removes one of the listed conflicts at least:

SENTENCE
    : PRINT
    | VAR_ASSIGN
//    |
;

==>

$ node dist/cli-cjs.js examples/issue-35-gho.jison --main
WARNING: missing actions for states:  [ 33, 34, 35, 36, 37, 38 ]
WARNING: missing actions for states:  [ 33, 34, 35, 36, 37, 38 ]
WARNING: missing actions for states:  [ 33, 34, 35, 36, 37, 38 ]

States with conflicts:

State 56    (PI @ PRINT_option -> .)

   PRINT -> PRINTLN .PRINT_option e PRINT_option2   #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> PRINTLN .e                    #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT_option -> .                      #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  PRINT_option -> .PAR_OPEN              #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  e -> .e + e
  e -> .NOT e
  e -> .e OR e
  e -> .e AND e
  e -> .e = e
  e -> .e <> e
  e -> .e - e
  e -> .e * e
  e -> .e / e
  e -> .e > e
  e -> .e < e
  e -> .e >= e
  e -> .e <= e
  e -> .e ^ e
  e -> .- e
  e -> .( e PAR_CLOSE
  e -> .e %
  e -> .TRUE
  e -> .FALSE
  e -> .NUMBER
  e -> .STRING
  e -> .IDENT
  e -> .E
  e -> .PI

----------------------------------- NOTICE -------------------------------
Attempting to resolve the unresolved conflicts in partial LR mode...

When no conflicts are reported in the next round below, your grammar is
accepted as mixed LR/LALR and should work as expected.
--------------------------------------------------------------------------

States with conflicts:

State 56    (PI @ PRINT_option -> .)

   PRINT -> PRINTLN .PRINT_option e PRINT_option2   #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT -> PRINTLN .e                    #lookaheads= [END]  [PRINTLN]  [ID]
  PRINT_option -> .                      #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  PRINT_option -> .PAR_OPEN              #lookaheads= [NOT]  [-]  [(]  [TRUE]  [FALSE]  [NUMBER]  [STRING]  [IDENT]  [E]  [PI]
  e -> .e + e
  e -> .NOT e
  e -> .e OR e
  e -> .e AND e
  e -> .e = e
  e -> .e <> e
  e -> .e - e
  e -> .e * e
  e -> .e / e
  e -> .e > e
  e -> .e < e
  e -> .e >= e
  e -> .e <= e
  e -> .e ^ e
  e -> .- e
  e -> .( e PAR_CLOSE
  e -> .e %
  e -> .TRUE
  e -> .FALSE
  e -> .NUMBER
  e -> .STRING
  e -> .IDENT
  e -> .E
  e -> .PI

Unused productions in your grammar:
  UNARY_OPERATOR[*RIP*] -> &
  UNARY_OPERATOR[@3 *RIP*] -> *
  UNARY_OPERATOR[@2 *RIP*] -> +
  UNARY_OPERATOR[@2 *RIP*] -> -
  UNARY_OPERATOR[*RIP*] -> ~
  UNARY_OPERATOR[*RIP*] -> !

JISON output for module [issue_35Gho] has been written to file: issue-35-gho.js

Closer inspection of the grammar reveals the PRINT rule to be ambiguous: the way it is written you wish to accept all these statements:

println e
println ( e
println e )
println ( e )

as legal complete statements, while ( (PAR_OPEN) following println can also start (and thus be part of) the e expression itself -- though the e alternatives list seems 'tweaked' in an attempt to prevent this as there PAR_OPEN is not listed, but a literal ( instead... but ambiguity is also elsewehere:

PRINT rule productions alt1 and alt2 are inherently ambiguous as alt1 can parse the same material as alt2:

PRINT
    : PRINTLN PAR_OPEN? e PAR_CLOSE?    // alt1
        { $$ = 'echo ' + $e + ';' }
    | PRINTLN e                            // alt2
        { $$ = 'echo ' + $e + ';' };

as alt1 can parse all these variants:

println e
println ( e
println e )
println ( e )

while alt2 can parse this:

println e

The LALR/LR compiler in jison will find this duplication and reports the conflict as a consequence.

Removing alt2 from that rule removes the second conflict and produces a parser:

PRINT
    : PRINTLN PAR_OPEN? e PAR_CLOSE?
        { $$ = 'echo ' + $e + ';' }
;

==>

$ node dist/cli-cjs.js examples/issue-35-gho.jison --main
WARNING: missing actions for states:  [ 32, 33, 34, 35, 36, 37 ]
WARNING: missing actions for states:  [ 32, 33, 34, 35, 36, 37 ]
WARNING: missing actions for states:  [ 32, 33, 34, 35, 36, 37 ]

Unused productions in your grammar:
  UNARY_OPERATOR[*RIP*] -> &
  UNARY_OPERATOR[@3 *RIP*] -> *
  UNARY_OPERATOR[@2 *RIP*] -> +
  UNARY_OPERATOR[@2 *RIP*] -> -
  UNARY_OPERATOR[*RIP*] -> ~
  UNARY_OPERATOR[*RIP*] -> !

JISON output for module [issue_35Gho] has been written to file: issue-35-gho.js

Running the test code which I added (combined with that jison --main option):

// ----------------------------------------------------------------------------------------

%%

// feature of the GH fork: specify your own main.
//
// compile with
// 
//      jison -o test.js --main that/will/be/me.jison
//
// then run
//
//      node ./test.js
//
// to see the output.

var assert = require("assert");

parser.main = function () {
    var inputs = [
        `
def hello()
    println "dsasd"
end
        `, `
def hello()
    a = 3
    println "dsasd"
end
        `
    ];
    for (var i = 0, l = inputs.length; i < l ; i++) {
        var rv = parser.parse(inputs[i]);
        console.log(inputs[i], " ==> ", rv);
    }

    // if you get past the assert(), you're good.
    console.log("tested OK");
};

==>

$ node issue-35-gho.js

def hello()
    println "dsasd"
end
          ==>  function hello(){echo "dsasd";}

W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\issue-35-gho.js:2175
            throw ex;
            ^
JisonParserError: Parse error on line 3:
    a = 3
------^
Expecting "ASSIGN", got unexpected EQUAL
    at Object.parseError (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\issue-35-gho.js:1512:15)
    at Object.parse (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\issue-35-gho.js:2028:30)
    at Parser.parser.main (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\issue-35-gho.js:4089:25)
    at Object.exports.main (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\issue-35-gho.js:4125:29)
    at Object.<anonymous> (W:\Users\Ger\Projects\sites\library.visyond.gov\80\lib\tooling\jison\issue-35-gho.js:4151:11)
    at Module._compile (module.js:635:30)
    at Object.Module._extensions..js (module.js:646:10)
    at Module.load (module.js:554:32)
    at tryModuleLoad (module.js:497:12)
    at Function.Module._load (module.js:489:3)

which shows the grammar needs a little cleanup in the token/lexer department to work; as I found it odd that you had these lexer rules:

"="                   return 'EQUAL'
"!="                  return '='

I change them to the JavaScript language rules to disambiguate assignment and equality comparison operators (JavaScript: = vs. == / ===):

"="                   return 'ASSIGN'
"=="                  return '='
"!="                  return '<>'
"<>"                  return '<>'

while making sure I use the token name matching the one used in the grammar.

Now if you correct the brackets in the e set, you'll run into another set of conflicts due to the odd println patterns you're willing to accept, where you use the same PAR_OPEN after println ==> the grammar compiler then cannot conclude which path to follow when println is followed by PAR_OPEN: should it match the PRINT rule part PAR_OPEN? before e or should it already enter e and match the PAR_OPEN in there for the alternative e: PAR_OPEN e PAR_CLOSE?

Hence after this fix:

e
    : e '+' e
        {$$ = $1 + $3;}
    | 'NOT' e
        {$$ = !$2;}
    | e 'OR' e
        {$$ = $1 || $3;}
    | e 'AND' e
        {$$ = $1 && $3;}
    | e '=' e
        {$$ = $1 == $3;}
    | e '<>' e
        {$$ = $1 != $3;}
    | e '-' e
        {$$ = $1 - $3;}
    | e '*' e
        {$$ = $1 * $3;}
    | e '/' e
        {$$ = $1 / $3;}
    | e '>' e
        {$$ = $1 > $3;}
    | e '<' e
        {$$ = $1 < $3;}
    | e '>=' e
        {$$ = $1 >= $3;}
    | e '<=' e
        {$$ = $1 <= $3;}
    | e '^' e
        {$$ = Math.pow($1, $3);}
    | '-' e %prec UMINUS
        {$$ = -$2;}
    | PAR_OPEN e PAR_CLOSE      // fix: PAR_OPEN instead of literal '('
        {$$ = $2;}
    | e '%'
        {$$ = $1 / 100;}
    | TRUE
        {$$ = true;}
    | FALSE
        {$$ = false;}
    | NUMBER
        {$$ = Number(yytext);}
    | STRING
        {$$ = $1;}
    | IDENT
        {$$ = __data__[$1];}
    | E
        {$$ = Math.E;}
    | PI
        {$$ = Math.PI;}    
;

we need to remove the ? ONE-OR-ZERO choice operator in the PRINT rule. Assuming you want to accept these variants:

println ( e )
println e

one might be inclined to write:

PRINT
    : PRINTLN PAR_OPEN e PAR_CLOSE
        { $$ = 'echo ' + $e + ';' }
    | PRINTLN e
        { $$ = 'echo ' + $e + ';' }
;

but this will introduce another ambiguity as there's already

e: PAR_OPEN e PAR_CLOSE

to match the first PRINT production, hence this suffices:

PRINT
    : PRINTLN e
        { $$ = 'echo ' + $e + ';' }
;

to match both those statements.

If you only wish to accept

println(e)

then the PRINT rule should read:

PRINT
    : PRINTLN PAR_OPEN e PAR_CLOSE
        { $$ = 'echo ' + $e + ';' }
;
GerHobbelt commented 6 years ago

The above is contained in the adjusted gist: https://gist.github.com/GerHobbelt/082eaff73fbb442abeb8a2ed7b033d97/revisions

command line to compile:

jison --main grammar.jison

when you want jison to produce a JS file with a parser which can be invoked from the (NodeJS) commandline like this:

node grammar.js
juliandavidmr commented 6 years ago

Hello, your response has been very good, you have really helped me understand several elements of Jison. I will apply your recommendations to a University project called rp , it is a project I am currently working on in my spare time. Thanks @GerHobbelt

Sent from my Moto G(4) using FastHub

juliandavidmr commented 6 years ago

Thank you for taking the time to analyze and answer my question. I'll be going over all the details you just showed me.

GerHobbelt commented 6 years ago

Closing this issue then. Re-open if there's questions about this material; otherwise it's better to open another issue to keep them more or less 'atomic'.

Cheers.