Inconsistent processing of grammar between XText and Langium

steve-hickman-epistimis commented 1 year ago

The following set of XText grammar rules will be processed correctly by ANTLR3 without ambiguity:

/** An Enumerated is a value type representing a set of named values, each with specific meaning. */
LogicalEnumerated:

    'enum'  name=ID (description=STRING)? 
        ('stdRef:' standardReference=STRING)?
        '['
        label+=(LogicalEnumerationLabel | LogicalEnumeratedSet) ','? (label+=(LogicalEnumerationLabel| LogicalEnumeratedSet) ','?)*
        ']'
        ';'
        ;

/**EXTENSION: Creates the foundation for a hierarchy of enumerated values  */
LogicalEnumeratedBase:
    LogicalEnumeratedSet |
    LogicalEnumerationLabel |
    LogicalEnumerated
;

/**EXTENSION: allowing label to be a LogicalEnumeratedSet enables taxonomy definitions via enumerations  */
LogicalEnumeratedSet:
        '(' name=ID (description=STRING)? 
        '['
        label+=(LogicalEnumerationLabel | LogicalEnumeratedSet) ','? (label+=(LogicalEnumerationLabel| LogicalEnumeratedSet) ','?)*
        ']' ')'
        ;

/** An EnumerationLabel defines a named member of an Enumerated value set. */
LogicalEnumerationLabel:    name=ID |   '(' name=ID description=STRING ')' ;

From these rules, xtext2Langium generates the following:

LogicalEnumerated infers LogicalEnumerated:
    'enum' name=ID  (description=STRING )? ('stdRef:' standardReference=STRING  )? '[' label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? (label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? )* ']' ';'  
;

LogicalEnumeratedBase infers LogicalEnumeratedBase:
    LogicalEnumeratedSet | LogicalEnumerationLabel | LogicalEnumerated 
;

LogicalEnumeratedSet infers LogicalEnumeratedSet:
    '(' name=ID  (description=STRING )? '[' label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? (label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? )* ']' ')'  
;

LogicalEnumerationLabel infers LogicalEnumerationLabel:
    name=ID  | '(' name=ID  description=STRING  ')'  
;

Running npm run langium:generate on the generated Langium grammar results in the following errors:

Error: Parser definition errors detected:
-------------------------------
:176 - Ambiguous Alternatives Detected: <1 ,2> in <OR1> inside <LogicalEnumerated> Rule,
<(:KW, ID, STRING> may appears as a prefix path in all these alternatives.
See: https://chevrotain.io/docs/guide/resolving_grammar_errors.html#AMBIGUOUS_ALTERNATIVES
For Further details.
-------------------------------
:176 - Ambiguous Alternatives Detected: <1 ,2> in <OR2> inside <LogicalEnumerated> Rule,
<(:KW, ID, STRING> may appears as a prefix path in all these alternatives.
See: https://chevrotain.io/docs/guide/resolving_grammar_errors.html#AMBIGUOUS_ALTERNATIVES
For Further details.
-------------------------------
:180 - Ambiguous Alternatives Detected: <1 ,2> in <OR1> inside <LogicalEnumeratedBase> Rule,
<(:KW, ID, STRING> may appears as a prefix path in all these alternatives.
See: https://chevrotain.io/docs/guide/resolving_grammar_errors.html#AMBIGUOUS_ALTERNATIVES
For Further details.
-------------------------------
:184 - Ambiguous Alternatives Detected: <1 ,2> in <OR1> inside <LogicalEnumeratedSet> Rule,
<(:KW, ID, STRING> may appears as a prefix path in all these alternatives.
See: https://chevrotain.io/docs/guide/resolving_grammar_errors.html#AMBIGUOUS_ALTERNATIVES
For Further details.
-------------------------------
:184 - Ambiguous Alternatives Detected: <1 ,2> in <OR2> inside <LogicalEnumeratedSet> Rule,
<(:KW, ID, STRING> may appears as a prefix path in all these alternatives.
See: https://chevrotain.io/docs/guide/resolving_grammar_errors.html#AMBIGUOUS_ALTERNATIVES
For Further details.

I wasn't sure if this is a problem with xtext2langium (not doing a proper conversion) or a problem with Chevrotrain. it seems like the generated .langium is syntactically correct, so xtext2langium should be OK.

Is this is a weakness in Chevrotrain? Or something else?

I was able to fix this by changing:

LogicalEnumeratedSet infers LogicalEnumeratedSet:
    '(' name=ID  (description=STRING )? '[' label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? (label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? )* ']' ')'  
;

to

LogicalEnumeratedSet infers LogicalEnumeratedSet:
    '{' name=ID  (description=STRING )? '[' label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? (label+=(LogicalEnumerationLabel | LogicalEnumeratedSet ) ','? )* ']' '}'  
;

(Replacing the outer containing parentheses with curly braces).

spoenemann commented 1 year ago

Chevrotain has a fixed lookahead that defaults to three tokens, which is not sufficient in this example (you need four tokens to disambiguate). But we switched to a dynamic lookahead in Langium. @msujew the new lookahead should be able to handle the situation, right?

Which lookahead is chosen depends on the parser settings in langium-config.json

msujew commented 1 year ago

@steve-hickman-epistimis I'm not sure what version of Langium you're using, but version 1.0 should come with unrestricted lookahead by default. It should be able to handle any ANTLR3 grammar without any issues (and more). You might need to remove the chevrotainParserConfig#maxLookahead property from your langium config.

steve-hickman-epistimis commented 1 year ago

As far as I know, I'm using the most current version (sorry for the delay in responding - I had Covid)

dhuebner commented 1 year ago

@steve-hickman-epistimis So you are using v1.2.0 ? Did you tried to do what @msujew suggested?

steve-hickman-epistimis commented 1 year ago

I just updated the langium generator to the current version (1.2). I reverted the change listed above and that problem does not appear.

On Chevrotain config: Honestly, I'm so rusty on npm that I don't know where the langium config is - so I don't know how to change the lookahead.

One other note: While I do not get the error above now, I do get this:

src/language-server/universal-data-definition-language.langium:511:36 - Keywords cannot only consist of whitespace characters.

The complaint is about:

hidden terminal WS returns string:(' ' | '\t' | '\r' | '\n' )+;

I would ignore this except I also get:

Langium generator failed.

Seems like this should be a configuration issue as well?

dhuebner commented 1 year ago

@steve-hickman-epistimis I think the best would be to ask further Langium related question in the Langium repository so that others can find your issue and find the solution :)

I tend to close this issue as not Xtext2Langium related.

steve-hickman-epistimis commented 1 year ago

It appears that this is caused by xtext2langium reading the Terminals.xtext file and incorrectly converting the content. It generated

hidden terminal ML_COMMENT returns string:'/*'  -> '*/'  ;
hidden terminal SL_COMMENT returns string:'//'  !('\n' | '\r' )('\r'? '\n' )?  ;
hidden terminal WS returns string:(' ' | '\t' | '\r' | '\n' )+;

The last line being the problem.

I think the simple solution is that xtext2langium should generate

hidden terminal WS:             /\s+/;

for WS instead.

msujew commented 1 year ago

Given that some devs might change the definition of their WS token, just generating \s+ might be a potentially destructive operation. We should probably just allow empty keywords in terminals.

dhuebner commented 1 year ago

@steve-hickman-epistimis Regarding maxLookahead config, it can be changed in langium-config.json file inside the chevrotainParserConfig section. See this example.

Empty keywords in terminals like hidden terminal WS returns string:(' ' | '\t' | '\r' | '\n' )+; are now allowed in Langium. So I'm closing the issue. Thanks for reporting!

TypeFox / xtext2langium

Inconsistent processing of grammar between XText and Langium #9