digama0 / mm0

Metamath Zero specification language
Creative Commons Zero v1.0 Universal
315 stars 40 forks source link

prevent error cascades via stmt_recover #30

Closed ammkrn closed 4 years ago

ammkrn commented 4 years ago

I don't expect this to be pulled, just food for thought.

1. Right now, the parser only tries to recover from an error after finding a closing semicolon delimiter, but do blocks don't contain any semicolons. If you edit the space above a do-block without the first keystroke being a semicolon (eg. by trying to add a new axiom right under peano.mm1/ax_mp), the parser ignores the whole semicolon-free do block, which then invalidates the declarations needing the stuff in the block, and you end up with the 1k+ errors wall of error messages complete with dying editor and spinning fan sounds. The change here side-steps it by initially skipping past a bad parse by one word and its trailing whitespace, then allowing it to try and recover when it finds the start of a do block. You also get this kind of error cascade if you start editing above a frequently used theorem, like if you try to add a new statement right above ax_mp since ax_mp gets skipped, and everything that used it errors out, and so on. The comment under the do block check shows one way to alleviate this, which would be to let the parser try to recover when it encounters keywords that are likely to start a line (modifiers, axiom, theorem, etc).

2. Not sure whether this is by design, but It doesn't seem like keywords are shielded from use as ident items; the parser currently accepts :

axiom axiom (ph ps: wff): $ ph -> ps $;

One consequence of this is that easy solutions to (1) produce less than ideal diagnostic messages. I'm not sure how often this would come up in practice, but an example of something that the parser would have a much easier time of analyzing if keywords were reserved is:

axiom

do {
  (def x 10)
};

The reported error is "expected }" since the control flow indicates that it's expecting an axiom called 'do' that starts with a list of bound variables, when it's actually an incomplete axiom followed by a do block. If the parser function for ident knew that do was a protected item, you could return return a really detailed message like "use of reserved keyword 'do' as identifier, or incomplete axiom declaration at (span)'.

digama0 commented 4 years ago

(2) is indeed deliberate, and it is flagged in mm0.md#pseudo-keywords. (I know this is an unusual choice, but the idea is to avoid details of the mm0 grammar getting in the way of identifier choices especially in autogenerated files translated from other systems.) The semicolon recovery approach was chosen because it is fast and predictable, but you are right that it has undesirable results when you aren't careful to put a semicolon on the statement you are writing right away.

One option that might work is to restart parsing at the first keyword following a bad parse. So with axiom axiom (foo bar do { }; it would first try to parse the whole thing, then axiom (foo bar do { };, then do { }; and succeed (and it would follow up with a parser error message appropriate for axiom axiom (foo bar, without consuming tokens from the do block). The downside of this is that it involves a decent amount of backtracking to re-parse the same tokens repeatedly when you have something with a lot of keywords like axiom axiom theorem do <garbage>;. The fact that there might be false positives if there are words in the proof that look like keywords isn't a very big deal because it's a heuristic anyway, there will be a parse error regardless.

ammkrn commented 4 years ago

(2) is indeed deliberate, and it is flagged in mm0.md#pseudo-keywords.

Thanks, I forgot about all of the interop/translation stuff. In the context of machine generated code from other systems that makes a lot of sense. The PR is pretty close to what you suggested (great minds lol), so I'm on board with that.

A way to cut down on the amount of backtracking would be to have the functions that parse idents set an Option<usize> in the Parser struct so that it points to the first ident that clashes with a keyword. If the whole stmt ends up being a parse failure, you know either to jump straight to the first keyword/ident clash, or that you can just fast-forward to a semicolon if there wasn't one. The check can be skipped if the field is already Some(..), and it would just reset after every loop iteration. I'll test it out first, but let me know if you foresee any problems with that.