ethereum / fe

Emerging smart contract language for the Ethereum blockchain.
https://fe-lang.org
Other
1.6k stars 178 forks source link

Parser recovery rework #972

Closed sbillig closed 4 months ago

sbillig commented 8 months ago

What was wrong?

How was it fixed?

To-Do

sbillig commented 5 months ago

@Y-Nak I guess I'm done with this. It's not perfect, but I think it's better.

Y-Nak commented 5 months ago

This is probably related to how to handle opening/closing brackets thing since these bracket kinds make a semantic scope that should break the inheritance of the recovery token generally, I guess.

sbillig commented 5 months ago

I think we still need Override or a similar way to prohibit the parser from using the parent scope's recovery tokens, so that error recovery doesn't change the semantics of the user's code.

Preventing the parser from using the parent scope's recovery tokens would change the semantics of this code:

impl Foo {

struct S {}
fn foo(x: i32) {
1
}

(ie the user didn't close the impl block, or is in the process of typing it).

I think it's impossible to avoid misinterpreting the semantics of the user's code in all cases. I think the "best" solution is to use the indentation level as a hint about the user's intention. In your example, the indentation implies that the struct is being defined inside the impl block, and could thus be recognized as disallowed. That said, I don't currently have the energy to implement such a thing.

Lacking indentation hints, when the parser finds an unexpected token which is an explicit recovery token in a parent scope, I think the most reasonable thing to do is to stop trying to parse the current scope and defer to the parent. If we allow the current scope to continue in the face of unexpected tokens, a single unclosed block could conceivably swallow the entire remainder of the file (unless we do something like treat some available closing brace as the end of the block, which can also introduce changes to semantics).

I chose to prioritize the case where the user is currently editing the file, or made a simple error (like a missing closing brace), over more complicated errors (like defining something where it's not allowed). The current implementation should handle the former reasonably well, but will change the semantics in the latter. Maybe a better error message could explain to the user that in your example the parser has decided that the struct keyword implies that the impl block is closed.

Y-Nak commented 5 months ago

I'm sorry for the explanation that was too vague. I do understand that it's essentially a heuristic issue, regardless of how much we delve into it. What has particularly caught my attention is the aggressiveness of an error recovery. Of course, there's a degree of subjectivity in determining which recovery are considered "too" aggressive, though.

The following example concerns me especially:

trait Trait {
    type Type;
    fn foo(x: i32);
}

I suppose Rust users would often make this mistake in Fe. However, if the parser interprets type Type and `fn foo' as being defined at the parent module scope, the error message could be quite misleading. This problem could be resolved by simplifying the BNF that the parser refers to as a specification, as you previously mentioned. Simplifying the BNF would likely allow each parsing scope to define more syntax kinds for error recovery. Semantic errors could be addressed by the later analysis phase.

As another problem, regarding unclosed scopes, I'm starting to feel that special handling will be necessary for this case. Essentially, I think it'd be better to (conceptually) give up parsing when the code is in unclosed scopes. This is because guessing the user's original intent becomes particularly challenging in this case, and aggressive error recovery could lead to misleading error messages. This problem might be related to the error refinement idea because it brings another complexity if we try to let the parser determine whether a scope is properly closed while parsing the internal syntax elements of the scope.
This is related to the ADT recursion problem as well. That is, the more we try to componentize the compiler, the more difficult it is to emit better error messages when an error is related to another error emitted from another compiler component.
So, before outputting accumulated error messages, we might need a phase that refines the accumulated errors. For example, if the spans of two errors overlap, only one of the errors is selected to be output(at least, the output of one error should be suppressed if it is encompassed by another). In the parser case, each parsing scope could just accumulate errors without caring about whether the scope is closing or not; then when the parser notices that the scope is not closed, the child node can be wrapped up by the error node. After that, the error refinement phase would suppress the errors that were accumulated inside the unclosed scope.

So, basically, I feel that we probably need to

  1. make BNF simpler
  2. wrap up an entire unclosed scope by an error syntax node
  3. add an error refinement phase

But I'm unsure if this is a proper direction. I'd really appreciate your opinion. NOTE: I do not think these problems should be resolved in this PR either way. This PR has already resolved many other problems and significantly improved the quality of the parser implementation. I just want to deepen our understanding of these issues for the future.

sbillig commented 4 months ago

The following example concerns me especially:

trait Trait {
    type Type;
    fn foo(x: i32);
}

I suppose Rust users would often make this mistake in Fe. However, if the parser interprets type Type and `fn foo' as being defined at the parent module scope, the error message could be quite misleading. This problem could be resolved by simplifying the BNF that the parser refers to as a specification, as you previously mentioned. Simplifying the BNF would likely allow each parsing scope to define more syntax kinds for error recovery. Semantic errors could be addressed by the later analysis phase.

One easy option is to explicitly add parser support for cases that we feel might be common errors, and/or for syntax that we intend to support someday. (Like the fn-def-inside-struct case that you implemented.) In this situation, I think it would be maybe be ideal to recover to the parent scope on the type keyword if it's not indented.

In the parser case, each parsing scope could just accumulate errors without caring about whether the scope is closing or not; then when the parser notices that the scope is not closed, the child node can be wrapped up by the error node. After that, the error refinement phase would suppress the errors that were accumulated inside the unclosed scope.

How does the parser notice that the scope is not closed? The only way it can be certain is if it reaches the end of the file and failed to find a matching closing bracket, which of course means that the scope has consumed the rest of the file. This seems like a pretty poor outcome to me.

The alternative of course is to guess the user's intent, and I think we can do pretty well by considering indentation and looking for keywords that definitely aren't allowed (or likely to be mistakenly used) in a given scope. We could make this nicer by explicitly telling the user when the parser has considered an unclosed scope to be closed. (I'm just repeating myself, because I can't think of any better ideas :)

I think it's actually ok to allow for the possibility of misleading errors (like every compiler I've ever used), in favor of nice behavior most of the time. Perhaps the parser could behave differently in LSP mode vs CLI mode, with the latter being more strict?

Perhaps we could mark nodes that are in a questionably-parsed region (not sure how to define this) as "tainted", and suppress all errors that involve them? (Eg in the previous crappy parser/analyzer, we suppressed "item not found" errors relating to files that didn't parse successfully, assuming that perhaps the missing item wasn't parsed.)