Closed digama0 closed 3 days ago
I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:
char *c = "foo /* "bar bar aux */";
The other issue, which I think is also present in the existing code, is that we should allow the Quote
form to have almost entirely arbitrary material inside it. That then means that the Quote
form of special string literal shouldn’t pre-commit to particular notions of what is/isn’t an inner string literal, nor what comments look like etc.
As per discussion on the Discord, I'm super-happy with the handling of caret. (The thinking behind the earlier weirdness is that we could use ^
as a backslash-analogue, thereby giving us a way of putting back-quote (or the Unicode right-quotation marks) into quotations. I'm not quite sure what, if anything might be done for that.)
I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:
Well no, because C code doesn't have comments in strings. And neither does HOL in its string literals. Quotes are not a kind of string literal, they are a full-fledged piece of syntax within which you can refer to HOL objects instead of ML objects. It's really not possible to treat them like string literals because antiquotation lets you fit arbitrary ML syntax inside them. To me, it's more like the following code:
char *c = { foo /* { bar bar aux */ };
(This intuition is much stronger for Theorem
, Definition
, Quote
etc compared to term quotations “foo”
because the former look like block commands and the latter looks like a string literal. This is also reflected in the error handling - quotes like ”
are "weak" and when used improperly will cause a parse error and be ignored, while End
is a "strong" quote end and when used improperly will forcibly terminate other unfinished things. This kind of error handling is effectively what you are asking for with unclosed comments in a quote, but it would have to be a parse error regardless.)
If there was actually a phase distinction between the inner and outer syntax, this might make sense, but the reason QuoteFilter has to parse inner comments in the first place is because there is no such phase distinction, because inner comments can comment out antiquotations. For Quote
, we could certainly turn off comments and antiquotations but I think that's not desirable?
If I write
Theorem name:
P x /\ (* q x ==> conclusion
Proof
tactic
QED
Theorem stmt2:
Q x /\ *) P
Proof
tactic
QED
this is surely an error in 99.9999% of all cases.
I don't really care what rationalisation is required to achieve it.
Doesn't the same reasoning apply to
Theorem name:
P x /\ ^( q x ==> conclusion
Proof
tactic
QED
?
But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.
Doesn't the same reasoning apply to
Theorem name: P x /\ ^( q x ==> conclusion Proof tactic QED
?
But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.
Your example should clearly be an error also, yes.
Nested comments don't require the ability to span levels with comments. Viz:
(*
Theorem foo:
p (* /\ q *) ==> r
Proof
tactic >> (* badtactic (* really_badtac *) >> might_work_tac *)
cheat
(* give up and don't even bother with a QED *)
*)
I find it hard to imagine a scenario where a user would really want to escape the nesting structure of the text with their comments. It also forces us to have the same comment syntax at the HOL level as at the SML level, whereas now it’s just a “happy accident”. Fundamentally, the language of HOL is completely separate from that of SML (it's in the name: "meta-language") and comments should respect that. (Anti-quotation is a cute and useful way to inject from meta to object; having comments in the object syntax affect the meta is gross.)
Note that the issue I hinted at already occurs in the wild: I had to add some additional hacks to support this definition
Definition RN_deriv_def : (* or `v / m` (dv/dm) *)
RN_deriv v m =
@f. f IN measurable (m_space m,measurable_sets m) Borel /\
(!x. x IN m_space m ==> 0 <= f x) /\
!s. s IN measurable_sets m ==> ((f * m) s = v s)
End
because the backquote inside a HOL comment was interpreted as ending the quote. This is the kind of thing I'm concerned about.
Here's my latest thinking on what I want. We define an obvious grammar for *Script.sml files, where we extend SML with new top-level declaration forms with rules like
<decl> ::= "^Theorem" {ident} <attributes>? ":" {HOLmaterial} "^Proof" <tactic> "^QED" | ...
We also have
<expr> ::= ...
| “{HOLmaterial}”
| ‘{HOLmaterial}’
| ``{HOLmaterial}``
| ...
here {ident} and {HOLmaterial} are lexical forms, definable with regular expressions. The angle brackets denote non-terminals (and <tactic>
is probably just SML
The regular expression defining {HOLmaterial} allows anything except occurrences of "quote-keywords", where quote-keywords are "^Theorem", "^Proof" ... plus the ASCII backquotes and the Unicode quotes.
Now in the absence of anti-quotations this is all we need and is capable of finding and dealing with all static SML issues. In the presence of anti-quoted SML code in the {HOLmaterial} we can get static SML errors and of course, the translation of this file into SML text has to also know about/handle the antiquotes. So after parsing according to the above, the HOLmaterial has to be separately lexed in a simple-minded way one more time. This is scanning for ^ident
(and maybe ^(...)
), which task has to know about HOL's notion of string literal and comment.
This second lexing process doesn't need to inform the HOL parsing stack, but might flag errors (malformed string literals, unterminated comments). Equally, those errors can be left to be caught by the HOL parser.
Thanks for your work on this!
This is effectively a complete rewrite of the
QuoteFilter
module. The intention is to support additional introspection features on the parsing process, but for this initial PR the main goal is to just replicate the behavior of the original parser. A summary of user-facing features:QuoteFilter
is replaced byHolLex
andQFRead
byHolParser
. The interface ofQuoteFilter
->HolLex
has changed somewhat, butHolParser
is basically a drop in replacement forQFRead
.unquote
command supports the old parser usingunquote -q
. In particular, the new parser does not (currently?) support the--quotefix
option sounquote
will use the old parser for that.unquote
function also makes use of the new parser's ability to report (non-fatal! with recovery!) parse errors, so you can run this command over all your.sml
files to spot any invalid inputs that were working by accident before. Two examples of this were found in this repository:borelScript.sml
has some cases of mismatched quotes i.e.‘term`, `term’
ttt_demoScript.sml
has an unmatched parenthesis, so presumably this file is not being checked?There are some deliberate (/ known and pending discussion) parser changes:
Comments in quotations are no longer auto-terminated by the next close quote orremovedEnd
, and will in fact comment out the end quote. I think this makes more sense in a world where HOL is one unified language with comments allowed anywhere with no special status given toEnd
.^^
has been removed.^x
and^(expr)
start antiquotations inside a quote, anything else involving^
is simply embedded as regular text in the quotation. So^^
is copied verbatim into the underlying string.A summary of design changes:
QuoteFilter
which does parsing at the same time as stringification into SML, the two steps have been separated.HolLex
still does the lexing and parsing, but it targets an AST structure instead of producing strings. As a result it is significantly simpler, needing to carry less state around. It makes a lot of use ofcontinue()
to be able to parse commands to completion, which avoids needing one giant state machine as in the previous implementation. It is still a streaming parser at the top level.HolParser
, which contains a streaming implementation of the stringification to SML code.HolLex
no longer tracks line numbers, soHolParser.ToSML
does that. (Part of the motivation for this is becausequotefix
was previously implemented using if statements around everything, and it will be easier to implement here as a separate pass which does a different kind of stringification.)HolParser.Simple
, you get a stream of ASTTopDecl
elements. These are all marked up with positions in the input, so you need the original source on hand to cross reference with it.HolParser.ToSML
, you get a stream of marked text regions:regular
is for input text copied verbatimaux
is for interstitial raw text to call internal functions likeQ.prove
strcode
is for input text which should be escaped to fit in an ML stringstrstr
is for input text which is already in an ML string context, but needs escaping on raw unicode charactersHolParser.ToSML
in combination withmkStrcode
which just encodes the four kinds of text above as appropriate, you get just a plain string out. All of the original entry points are using this now.mkDoubleReader
, which gives a way to have a streaming reader which is followed by a second reader (which can be conceptualized as doing "skip to index i, read to j >= i, skip to k >= j, ...") which outputs results yielded by the first reader, without buffering the entire contents. This is possibly overkill in this situation, and it does lead to some limitations (the marked-up text usingregular
etc has to come in the same order as in the original source), but at least you can read gigabyte files this way.