Closed gluax closed 3 years ago
@acoglio @bendyarm opened an issue for discussion we had in slack the other day.
An early version of the grammar had format strings spelled out:
https://github.com/AleoHQ/leo/blob/e30486669660fafb8773428dac3834b5bee77351/grammar/abnf-grammar.txt#L467
However, that version allowed neither the escapes that we now allow in string literals, nor the ability to write an actual {}
.
I think the following could work:
not-double-quote-or-backslash-or-brace = %x0-21
/ %x23-5B
/ %x5D-7A
/ %x7C
/ %7E-10FFFF
; anything but " or \ or { or }
format-string-element-not-brace = not-double-quote-or-backslash-or-brace
/ simple-character-escape
/ ascii-character-escape
/ unicode-character-escape
format-string-container = "{}"
format-string-open-brace = "{{"
format-string-close-brace = "}}"
format-string-element = format-string-element-not-brace
/ format-string-container
/ format-string-open-brace
/ format-string-close-brace
format-string = double-quote *format-string-element double-quote
; the rule for tokens must be then extended:
token = keyword
/ identifier
/ atomic-literal
/ package-name
/ annotation-name
/ symbol
/ format-string ; added
Observations about the above rules:
format-string-element-not-brace
is like string-literal-element
, but it excludes {
and }
.{}
. It cannot use backslash escapes, e.g. \x7b\x7d
prints the same as {{}}
.{{
. It cannot use backslach escapes, e.g. \x7b\x7b
prints the same as {{{{
.}}
. It cannot use backslach escapes, e.g. \x7d\x7d
prints the same as }}}}
.{
is illegal, which is consistent with Rust (I checked). But for example an isolated \x7b
is legal and prints the same as {{
.}
is illegal, which is consistent with Rust (I checked). But for example an isolated \x7d
is legal and prints the same as }}
.format-string
has to be a separate token from string-literal
.If we wanted to treat backslash escapes for open and close braces in the same way as actual {
and }
, e.g. if we want \x7b\x7d
to be a container, it becomes impractical to define a format-string-element-not-brace
that excludes backslash escapes for open and close braces: ABNF ranges and alternations work well to exclude single characters (e.g. see not-double-quote-or-backslash-or-brace
), but not to exclude contiguous chunks of characters. So we would have to resort to an extra-grammatical requirement expressed as a comment. There seems to be little practical reason to use backslack escapes in what can be more simply written as {}
and {{
and }}
.
If we go with the different treatment of {
and }
vs. their backslash escape counterparts as the above rules imply, then it means that we cannot quite lex a format string as a string literal and then immediately convert it to the format string data structure, because that lexing would lose the distinction between {
and \x7b
for instance. So format strings would have to be lexed their own way, which is actually reasonable given that they are a different kind of token as manifested in the rule.
@acoglio Hmmm, it might be difficult to restrict non-escaped characters from valid format-strings
. Would have to write separate character parsing rules entirely, but that seems like unnecessary work. Especially in the case of something like \x7b
or \x7d
we treat the escaped as the regular characters {
. }
past parsing and have no notion of it being an escaped character. Plus users will definitely want to make use of escaped characters including \"
.
Personally it's more natural to allow any character but a sequential {}
except via some sort of escape. Since we already parse the first argument as regular leo-ast string
, then convert it to a leo format-string
. We'd have to write another parser to exclude certain characters specifically in the first argument to console functions.
@bendyarm, I actually like that point. We could refactor the parser not to transform it to a FormattedString
right away. We can just change the argument in the AST and ASG to a string
(inline array in ASG's case). There may be some trickiness around that since AST has a notion of a string
type before canonicalization, so the console AST node would take a string
. So during canonicalization the argument would be converted to an inline array, but the node wouldn't support that. Solved easily by an enum, but might add some complexity to the AST for console
statements. Then we take the current FormattedString
logic and move it to the compiler as just logic to print rather than converting(speeding up the compiler possibly).
Sorry, I deleted the point because it had some stuff I needed to investigate more
For reference, here is what Python handles in the containers:
https://docs.python.org/3/library/string.html#format-string-syntax
and then
https://docs.python.org/3/library/string.html#format-specification-mini-language
Some reasons to keep \x7b
and \x7d
as immediate lexical
replacements for {
and }
in a string so that "{}"
in a
console.log string could be gotten with "\x7b\x7d"
(some of which are speculative):
\xHH
and \u{HHHH}
do
(except for the non-scalar-value codes)"{}"
{
and }
are hard to use"{}"
are too similar to "()"
so you want to make them look more different"{}"
as another sort
of control so to pass through {}
you need to use \x7b\x7d
"{}"
for whatever reason but you need
to have a containerAh, all good, I like the direction you were heading in non-the-less! TLDR: It means keeping the console
arguments as a string-literal
according to the grammar all the way through. The for printing we parse the string
, like how we do for the current FormattedString
. This keeps the grammar the way it is(at some possible minor complication to the AST), but is also more efficient most likely.
I like the idea of handling the format string mini language in the parser, but @acoglio is right, abnf makes it infeasible to define not-open-brace. There are too many synonyms of open-brace:
open-brace = "{" / "\x7b" / "\x7B" / "\u{00007b} / ... 9 more
So the main decision is still: Are the \x and \u escapes for ASCII chars always handled the same as the ASCII chars are? (except for \x22 (") and \x5c (), which have to be handled differently)
If not, it looks like we can handle formatting strings in the grammar as @acoglio described. If so, we need to handle them later.
If we handle them later, we need to decide when.
(1) handle them right after parsing, possibly by another simpler abnf grammar that parses
the string but does not support escapes including \x or \u.
(2) in a canonicalization stage, although it isn't really a canonicalization
(3) If there will be a string append operator that can be constant-folded, maybe after constant folding.
(4..) or somewhere else that is convenient.
@bendyarm, they are handled the same.
Though I personally like the rust style of handling this situation, of which we are similar too. Right now, you can't pass a variable as the first argument to a console
function. It must be a string-literal. This means we could the approach I previously described.
@gluax I know they are currently handled the same. I meant that the main decision whether we should keep that behavior or change it. Rewording the above to be more clear what I meant:
So the main decision is still:
Should the \x and \u escapes for ASCII chars be always handled the same as the ASCII chars are?
(except for \x22
("
) and \x5c
(\
), which have to be handled differently)
This is a different specification issue than the question whether console functions require the first argument to be a literal string or whether they allow. I think literal string is fine. It simplifies a lot of things.
@bendyarm ah got you. I'm more of the opinion of keeping it the same personally. Why add more complexity to the AST for no reason by breaking down Scalar chars into regular, escaped, hex, unicode. It would be more consistent with other languages as well as we treat them as just a single scalar char type rather than the breakdown.
I'd be happy to leave the ABNF grammar as is, in particular with the rule
print-arguments = "(" string-literal *( "," expression ) ")"
and have the compiler validate that string in a separate step. In fact, I thought that was the conclusion of our discussion on the topic on Slack a couple of days ago, but then I saw @gluax opening this issue and I thought (and wrote) how we could use the ABNF grammar to describe format strings, mainly to explore the tradeoffs.
I agree that it would be nice to treat \x7b\x7d
the same as {}
(and similarly for others). The problem is not so much that there are many ways to represent {
or }
with escapes (fewer than 9 because we can use ABNF's case-insensitive strings to cover both uppercase and lowercase hex digits), but that it's very hard (actually, infeasible) to represent the complement of those (i.e. 'everything except for'), as I wrote. ABNF's numeric ranges can represent well complements of single characters (see the rule not-double-quote-or-backslash-or-brace
above), but those don't work when, besides excluding single character codes, we need to exclude escape chunks.
We could still use the ABNF grammar to capture some of the requirements on format strings, with comments to describe extra-grammatical requirements, as we do for identifiers for example:
identifier = letter *( letter / digit / "_" ) ; but not a keyword or aleo1...
Just exploring the space of possibilities.
The only issue is that, if we use string-literal
for format strings, they cannot be empty according to the grammar.
@acoglio ah yea I figured it better to open an issue for visibility and have a longer discussion on the tradeoffs.
If we want to go the string-literal
route for the empty string, we could add an or for an empty string in the abnf. Alternatively, I can have the compiler say it shouldn't expect an empty string. Though I think the former makes more sense to do.
I think that opening the issue was a good idea. I'm not complaining about that :)
We could relax the string-literal
rule
string-literal = double-quote 1*string-literal-element double-quote
to use *
instead of 1*
, and defer the check for non-emptiness to later in the compiler, but only for non-format-strings.
@acoglio, that also works, so I wouldn't error out at the parser level if a string is empty? But maybe later at canonicalization? Or would it stay as it is now? Technically the parser currently errors out for let s = "";
but not for console.log("");
. I'm actually not sure why it doesn't error out for the latter.
Yes, we could accept the empty string literal at parsing time, and then:
FormattedString
.But I'm not sure if checks 1 and 2 above should be done at canonicalization time or not:
Nonetheless, for practical purposes, we could just do check 1 as part of canonicalization, including the transformation to FormattedString
.
To be clear, currently the AST for console statements expects a FormattedString
?
@acoglio I was planning to do check 1 past the asg and other phases during the actual compiler phase where we log statements. It would be more efficient. Thus removing FormattedString from the AST and ASG. Also to answer your question yes, AST expects a FormattedString.
Check 2 sounds like either way, it would wind up in canonicalization because empty array checks happen there as well.
Yes, your idea of checking it at actual log time sounds good. This way we keep the AST and ASG simpler, as you say.
🐛 Bug Report
Currently, in the ABNF console, statements use
string-literal
. While true, the string-literal is then parsed to aformat-string
, and at the moment, we do not allow{}
to be printed.So two bugs here:
{}
. I think something like{{}}
is a good one.The hardest part IMO is representing a string that can allow all characters except "{}" but still allow something like "{{}}".