HOL-Theorem-Prover / HOL

Canonical sources for HOL4 theorem-proving system. Branch develop is where “mainline development” occurs; when develop passes our regression tests, master is merged forward to catch up.
https://hol-theorem-prover.org
Other
630 stars 142 forks source link

New and improved HOL parser #1333

Closed digama0 closed 3 days ago

digama0 commented 2 weeks ago

This is effectively a complete rewrite of the QuoteFilter module. The intention is to support additional introspection features on the parsing process, but for this initial PR the main goal is to just replicate the behavior of the original parser. A summary of user-facing features:

There are some deliberate (/ known and pending discussion) parser changes:

A summary of design changes:

mn200 commented 2 weeks ago

I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:

char *c = "foo /* "bar bar aux */";

The other issue, which I think is also present in the existing code, is that we should allow the Quote form to have almost entirely arbitrary material inside it. That then means that the Quote form of special string literal shouldn’t pre-commit to particular notions of what is/isn’t an inner string literal, nor what comments look like etc.

mn200 commented 2 weeks ago

As per discussion on the Discord, I'm super-happy with the handling of caret. (The thinking behind the earlier weirdness is that we could use ^ as a backslash-analogue, thereby giving us a way of putting back-quote (or the Unicode right-quotation marks) into quotations. I'm not quite sure what, if anything might be done for that.)

digama0 commented 2 weeks ago

I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:

Well no, because C code doesn't have comments in strings. And neither does HOL in its string literals. Quotes are not a kind of string literal, they are a full-fledged piece of syntax within which you can refer to HOL objects instead of ML objects. It's really not possible to treat them like string literals because antiquotation lets you fit arbitrary ML syntax inside them. To me, it's more like the following code:

char *c = { foo /* { bar bar aux */ };

(This intuition is much stronger for Theorem, Definition, Quote etc compared to term quotations “foo” because the former look like block commands and the latter looks like a string literal. This is also reflected in the error handling - quotes like are "weak" and when used improperly will cause a parse error and be ignored, while End is a "strong" quote end and when used improperly will forcibly terminate other unfinished things. This kind of error handling is effectively what you are asking for with unclosed comments in a quote, but it would have to be a parse error regardless.)

If there was actually a phase distinction between the inner and outer syntax, this might make sense, but the reason QuoteFilter has to parse inner comments in the first place is because there is no such phase distinction, because inner comments can comment out antiquotations. For Quote, we could certainly turn off comments and antiquotations but I think that's not desirable?

mn200 commented 2 weeks ago

If I write

Theorem name:
   P x /\ (* q x ==> conclusion 
Proof
  tactic
QED

Theorem stmt2:
   Q x /\ *) P 
Proof
  tactic
QED

this is surely an error in 99.9999% of all cases.

I don't really care what rationalisation is required to achieve it.

digama0 commented 2 weeks ago

Doesn't the same reasoning apply to

Theorem name:
   P x /\ ^( q x ==> conclusion 
Proof
  tactic
QED

?

But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.

mn200 commented 1 week ago

Doesn't the same reasoning apply to

Theorem name:
   P x /\ ^( q x ==> conclusion 
Proof
  tactic
QED

?

But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.

Your example should clearly be an error also, yes.

Nested comments don't require the ability to span levels with comments. Viz:

(* 
Theorem foo:
   p (* /\ q *) ==> r
Proof
  tactic >> (* badtactic (* really_badtac *) >> might_work_tac *)
  cheat
   (* give up and don't even bother with a QED *)
*)

I find it hard to imagine a scenario where a user would really want to escape the nesting structure of the text with their comments. It also forces us to have the same comment syntax at the HOL level as at the SML level, whereas now it’s just a “happy accident”. Fundamentally, the language of HOL is completely separate from that of SML (it's in the name: "meta-language") and comments should respect that. (Anti-quotation is a cute and useful way to inject from meta to object; having comments in the object syntax affect the meta is gross.)

digama0 commented 1 week ago

Note that the issue I hinted at already occurs in the wild: I had to add some additional hacks to support this definition

Definition RN_deriv_def : (* or `v / m` (dv/dm) *)
    RN_deriv v m =
      @f. f IN measurable (m_space m,measurable_sets m) Borel /\
          (!x. x IN m_space m ==> 0 <= f x) /\
          !s. s IN measurable_sets m ==> ((f * m) s = v s)
End

because the backquote inside a HOL comment was interpreted as ending the quote. This is the kind of thing I'm concerned about.

mn200 commented 1 week ago

Here's my latest thinking on what I want. We define an obvious grammar for *Script.sml files, where we extend SML with new top-level declaration forms with rules like

<decl> ::= "^Theorem" {ident} <attributes>? ":" {HOLmaterial} "^Proof" <tactic> "^QED" | ...

We also have

<expr> ::= ...
               | “{HOLmaterial}”
               | ‘{HOLmaterial}’
               | ``{HOLmaterial}``
               | ...

here {ident} and {HOLmaterial} are lexical forms, definable with regular expressions. The angle brackets denote non-terminals (and <tactic> is probably just SML ).

The regular expression defining {HOLmaterial} allows anything except occurrences of "quote-keywords", where quote-keywords are "^Theorem", "^Proof" ... plus the ASCII backquotes and the Unicode quotes.

Now in the absence of anti-quotations this is all we need and is capable of finding and dealing with all static SML issues. In the presence of anti-quoted SML code in the {HOLmaterial} we can get static SML errors and of course, the translation of this file into SML text has to also know about/handle the antiquotes. So after parsing according to the above, the HOLmaterial has to be separately lexed in a simple-minded way one more time. This is scanning for ^ident (and maybe ^(...)), which task has to know about HOL's notion of string literal and comment.

This second lexing process doesn't need to inform the HOL parsing stack, but might flag errors (malformed string literals, unterminated comments). Equally, those errors can be left to be caught by the HOL parser.

mn200 commented 3 days ago

Thanks for your work on this!