Order of recovery strategies in `recoverInline`?

michaelpj commented 9 years ago

Empirically speaking, I've found that the order in which the recovery strategies are tried in DefaultErrorStrategy.recoverInline is less useful than the reverse order. Single token insertion is less likely to apply, and hence usually gives a better recovery when it does apply. Plus, since it's fairly rare, you still end up using deletion in most of the contexts where it is helpful.

For example, I have a language where lower-cased identifiers are quite common, but separating keywords are fairly rare. Then given a sequence like ... <missing lowerid> keyword lowerid ... both token insertion and deletion are possible, but token insertion is a lot more helpful! It's possible that I just hit this kind of case more often, as I find my users are more likely to enter input that is missing a token than input that has an truly extraneous token (i.e. where we will get a sensical parse without it).

I don't have an argument for this in generality, but I was wondering whether this was something you'd tried?

(Of course, you can just override recoverInline, which is what I've done.)

sharwell commented 9 years ago

I believe the problem with putting single token insertion first is there are cases where you could enter an infinite loop of error recovery. Specifically, if you insert a token and then go to the same position where you were the first time you encountered an error, then the recovery would repeat itself until you ran out of memory.

sharwell commented 9 years ago

For example, the input y fed to the following grammar:

start
  : 'x'+ 'y' EOF
  ;

Edit: actually this would use single-token insertion anyway, since single-token deletion does not apply. :thought_balloon:

michaelpj commented 9 years ago

Indeed, it would use insertion as it stands, and after that it would simply match, meaning no need for recovery. I thought from looking at the code that that was the precondition for doing insertion - the subsequent token has to be acceptable. So you're guaranteed to make at least one token's worth of progress each time it fires.

sharwell commented 9 years ago

So an example where a preference for single-token insertion would be better is the following:

start
  : (Identifier ';')* EOF
  ;

Input:

Foo;
Bar
Foo2;

With the current algorithm, it will drop Foo2. However, if single-token insertion was preferred, it will insert the missing ; and proceed with Foo2; correctly matching.

I'm curious if we can come up with cases where a preference for single-token deletion actually has the "preferred" result.

michaelpj commented 9 years ago

A bit contrived, but I think this is better with deletion. Grammar:

start: 'A' 'B' 'C'
     | 'D' ;

Input:

BD

Insertion adds an A and gets an input mismatch at the D. Deletion removes the B and just matches the D.

On reflection, I'm not sure how confident I am that either will always be strictly preferable. That said, empirically humans seem to leave things out more often than adding extra stuff (in an IDE).

sharwell commented 9 years ago

That said, empirically humans seem to leave things out more often than adding extra stuff (in an IDE).

I tend to agree. It would be interesting to have code that attempts both deletion and insertion, and if only one succeeds uses that one. However, if both succeed it then uses some fixed k symbols of lookahead to see which one makes it "farther" (if k=∞ then you get best results but could be very slow). If both make it to k (or EOF) then choose insertion? Otherwise use the one that made it the farthest.

michaelpj commented 9 years ago

That sounds ideal, actually.

Your value for k probably depends on how context-sensitive your language is. I also think it's more intuitively acceptable for a user if the parser "gets confused" in a complex part of the language. Even k=1 or 2 would be a big improvement for my use cases.

antlr / antlr4

Order of recovery strategies in `recoverInline`? #798