eclipse-langium / langium

Next-gen language engineering / DSL framework
https://langium.org/
MIT License
665 stars 61 forks source link

Nodes that consume no text are undefined #1398

Closed drhagen closed 4 months ago

drhagen commented 4 months ago

Langium version: 3.0.0

When a node succeeds but consumes no input, it is undefined rather than a well-defined AstNode object.

Consider this grammar:

grammar HelloWorld

entry Model:
    'header' persons=Persons;

Persons:
    ((elements+=ID) (elements+=ID)*)?;

hidden terminal WS: /\s+/;
terminal ID: /[_a-zA-Z][\w_]*/;

when parsing this file:

header

When there are no persons listed, I would expect the resulting AST to be something like {persons: {elements: []}}. Instead, it is {persons: undefined}.

msujew commented 4 months ago

This is by design, the parser will never attempt to construct synthetic AST nodes (which {elements: []} would be) and doing so likely leads to very unexpected behavior, as there is no CST to associate to it. This would create a cascade of issues over all validations, scoping and LSP services.

In fact, I would argue that a completely optional parser rule is actually a grammar error and should be validated as such. I've created https://github.com/eclipse-langium/langium/issues/1400 to track this.

drhagen commented 4 months ago

That makes sense. If it is not a node but merely an attribute, is it still expected that the result is undefined? For example, this grammar

grammar HelloWorld

entry Model:
    'header'
    (persons+=ID ('+' persons+=ID)*)?
;

hidden terminal WS: /\s+/;
terminal ID: /[_a-zA-Z][\w_]*/;

when parsing

header

also generates the AST {persons: undefined}.

This is unexpected because the generated interface does not say this is an optional attribute:

export interface Model extends AstNode {
    readonly $type: 'Model';
    persons: Array<string>;
}
msujew commented 4 months ago

@drhagen There are two special rules in particular when it comes to AST construction about array and boolean properties. Both of these get a default value, [] and false respectively. Everything else will be undefined, assuming it hasn't been parsed properly.

I'm not too happy about this either, since it effectively misrepresents the state of an AST node in the TypeScript type system in the presence of parser errors. Consequently, Langium gives no guarantee about the "correctness" of an AST node if the parser encounters those issues. On the other hand, the AST is guaranteed to align to the generated types in case the parser returns without any errors. While services such as validation and scoping need to be able to deal with "incorrect" AST nodes, services such as code generators, interpreters or compilers (which only run on validated ASTs) can safely assume that these non-nullable properties are indeed available.

There is always a tradeoff when it comes to these things. It might make sense to have a Unconfirmed<AstNode> type that sets all properties (except for the base AST properties like $type) to be optionally undefined. That type can then be used in scoping and validation code to ensure the correct representation of erroneous ASTs.

drhagen commented 4 months ago

On the other hand, the AST is guaranteed to align to the generated types in case the parser returns without any errors.

header is a valid document for the parser in my previous message. It parses without error, but model.person is still undefined during validation of Model. Does the validation code run before parsing is complete and the default values are assigned?

msujew commented 4 months ago

It parses without error, but model.person is still undefined during validation of Model.

Yes, that's exactly why I said it's supposed to be a grammar error. It produces a nonsensical AST, even on correct input, because the grammar isn't written in a correct way. My thoughts are: correct grammar + correct input = correct AST.

Does the validation code run before parsing is complete and the default values are assigned?

No, validation is the last phase of the document lifecycle, see here.