New and improved HOL parser

digama0 commented 2 weeks ago

This is effectively a complete rewrite of the QuoteFilter module. The intention is to support additional introspection features on the parsing process, but for this initial PR the main goal is to just replicate the behavior of the original parser. A summary of user-facing features:

QuoteFilter is replaced by HolLex and QFRead by HolParser. The interface of QuoteFilter -> HolLex has changed somewhat, but HolParser is basically a drop in replacement for QFRead.
Most functions have been switched to only use the new parser, but the unquote command supports the old parser using unquote -q. In particular, the new parser does not (currently?) support the --quotefix option so unquote will use the old parser for that.
The unquote function also makes use of the new parser's ability to report (non-fatal! with recovery!) parse errors, so you can run this command over all your .sml files to spot any invalid inputs that were working by accident before. Two examples of this were found in this repository:
- borelScript.sml has some cases of mismatched quotes i.e. ‘term`, `term’
- ttt_demoScript.sml has an unmatched parenthesis, so presumably this file is not being checked?
- There are some more parse errors in CakeML, I will make a separate PR for that (https://github.com/CakeML/cakeml/pull/1085)

There are some deliberate (/ known and pending discussion) parser changes:

Comments in quotations are no longer auto-terminated by the next close quote or End, and will in fact comment out the end quote. I think this makes more sense in a world where HOL is one unified language with comments allowed anywhere with no special status given to End. removed
The escape-like behavior of ^^ has been removed. ^x and ^(expr) start antiquotations inside a quote, anything else involving ^ is simply embedded as regular text in the quotation. So ^^ is copied verbatim into the underlying string.

A summary of design changes:

Rather than having QuoteFilter which does parsing at the same time as stringification into SML, the two steps have been separated. HolLex still does the lexing and parsing, but it targets an AST structure instead of producing strings. As a result it is significantly simpler, needing to carry less state around. It makes a lot of use of continue() to be able to parse commands to completion, which avoids needing one giant state machine as in the previous implementation. It is still a streaming parser at the top level.
The rest of the implementation is in HolParser, which contains a streaming implementation of the stringification to SML code. HolLex no longer tracks line numbers, so HolParser.ToSML does that. (Part of the motivation for this is because quotefix was previously implemented using if statements around everything, and it will be easier to implement here as a separate pass which does a different kind of stringification.)
There are currently three possible entry points to the parser:
- Using HolParser.Simple, you get a stream of AST TopDecl elements. These are all marked up with positions in the input, so you need the original source on hand to cross reference with it.
- Using HolParser.ToSML, you get a stream of marked text regions:
- regular is for input text copied verbatim
- aux is for interstitial raw text to call internal functions like Q.prove
- strcode is for input text which should be escaped to fit in an ML string
- strstr is for input text which is already in an ML string context, but needs escaping on raw unicode characters
- Using HolParser.ToSML in combination with mkStrcode which just encodes the four kinds of text above as appropriate, you get just a plain string out. All of the original entry points are using this now.
One additional semi-reusable component here is mkDoubleReader, which gives a way to have a streaming reader which is followed by a second reader (which can be conceptualized as doing "skip to index i, read to j >= i, skip to k >= j, ...") which outputs results yielded by the first reader, without buffering the entire contents. This is possibly overkill in this situation, and it does lead to some limitations (the marked-up text using regular etc has to come in the same order as in the original source), but at least you can read gigabyte files this way.

mn200 commented 2 weeks ago

I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:

char *c = "foo /* "bar bar aux */";

The other issue, which I think is also present in the existing code, is that we should allow the Quote form to have almost entirely arbitrary material inside it. That then means that the Quote form of special string literal shouldn’t pre-commit to particular notions of what is/isn’t an inner string literal, nor what comments look like etc.

mn200 commented 2 weeks ago

As per discussion on the Discord, I'm super-happy with the handling of caret. (The thinking behind the earlier weirdness is that we could use ^ as a backslash-analogue, thereby giving us a way of putting back-quote (or the Unicode right-quotation marks) into quotations. I'm not quite sure what, if anything might be done for that.)

digama0 commented 2 weeks ago

I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:

Well no, because C code doesn't have comments in strings. And neither does HOL in its string literals. Quotes are not a kind of string literal, they are a full-fledged piece of syntax within which you can refer to HOL objects instead of ML objects. It's really not possible to treat them like string literals because antiquotation lets you fit arbitrary ML syntax inside them. To me, it's more like the following code:

char *c = { foo /* { bar bar aux */ };

(This intuition is much stronger for Theorem, Definition, Quote etc compared to term quotations “foo” because the former look like block commands and the latter looks like a string literal. This is also reflected in the error handling - quotes like ” are "weak" and when used improperly will cause a parse error and be ignored, while End is a "strong" quote end and when used improperly will forcibly terminate other unfinished things. This kind of error handling is effectively what you are asking for with unclosed comments in a quote, but it would have to be a parse error regardless.)

If there was actually a phase distinction between the inner and outer syntax, this might make sense, but the reason QuoteFilter has to parse inner comments in the first place is because there is no such phase distinction, because inner comments can comment out antiquotations. For Quote, we could certainly turn off comments and antiquotations but I think that's not desirable?

mn200 commented 2 weeks ago

If I write

Theorem name:
   P x /\ (* q x ==> conclusion 
Proof
  tactic
QED

Theorem stmt2:
   Q x /\ *) P 
Proof
  tactic
QED

this is surely an error in 99.9999% of all cases.

I don't really care what rationalisation is required to achieve it.

digama0 commented 2 weeks ago

Doesn't the same reasoning apply to

Theorem name:
   P x /\ ^( q x ==> conclusion 
Proof
  tactic
QED

?

But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.

mn200 commented 1 week ago

Doesn't the same reasoning apply to
Theorem name:
   P x /\ ^( q x ==> conclusion 
Proof
  tactic
QED
?

But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.

Your example should clearly be an error also, yes.

Nested comments don't require the ability to span levels with comments. Viz:

(* 
Theorem foo:
   p (* /\ q *) ==> r
Proof
  tactic >> (* badtactic (* really_badtac *) >> might_work_tac *)
  cheat
   (* give up and don't even bother with a QED *)
*)

I find it hard to imagine a scenario where a user would really want to escape the nesting structure of the text with their comments. It also forces us to have the same comment syntax at the HOL level as at the SML level, whereas now it’s just a “happy accident”. Fundamentally, the language of HOL is completely separate from that of SML (it's in the name: "meta-language") and comments should respect that. (Anti-quotation is a cute and useful way to inject from meta to object; having comments in the object syntax affect the meta is gross.)

digama0 commented 1 week ago

Note that the issue I hinted at already occurs in the wild: I had to add some additional hacks to support this definition

Definition RN_deriv_def : (* or `v / m` (dv/dm) *)
    RN_deriv v m =
      @f. f IN measurable (m_space m,measurable_sets m) Borel /\
          (!x. x IN m_space m ==> 0 <= f x) /\
          !s. s IN measurable_sets m ==> ((f * m) s = v s)
End

because the backquote inside a HOL comment was interpreted as ending the quote. This is the kind of thing I'm concerned about.

mn200 commented 1 week ago

Here's my latest thinking on what I want. We define an obvious grammar for *Script.sml files, where we extend SML with new top-level declaration forms with rules like

<decl> ::= "^Theorem" {ident} <attributes>? ":" {HOLmaterial} "^Proof" <tactic> "^QED" | ...

We also have

<expr> ::= ...
               | “{HOLmaterial}”
               | ‘{HOLmaterial}’
               | ``{HOLmaterial}``
               | ...

here {ident} and {HOLmaterial} are lexical forms, definable with regular expressions. The angle brackets denote non-terminals (and <tactic> is probably just SML ).

The regular expression defining {HOLmaterial} allows anything except occurrences of "quote-keywords", where quote-keywords are "^Theorem", "^Proof" ... plus the ASCII backquotes and the Unicode quotes.

Now in the absence of anti-quotations this is all we need and is capable of finding and dealing with all static SML issues. In the presence of anti-quoted SML code in the {HOLmaterial} we can get static SML errors and of course, the translation of this file into SML text has to also know about/handle the antiquotes. So after parsing according to the above, the HOLmaterial has to be separately lexed in a simple-minded way one more time. This is scanning for ^ident (and maybe ^(...)), which task has to know about HOL's notion of string literal and comment.

This second lexing process doesn't need to inform the HOL parsing stack, but might flag errors (malformed string literals, unterminated comments). Equally, those errors can be left to be caught by the HOL parser.

mn200 commented 3 days ago

Thanks for your work on this!

HOL-Theorem-Prover / HOL

New and improved HOL parser #1333