[Bug] Update ABNF to match Format Strings

gluax commented 3 years ago

🐛 Bug Report

Currently, in the ABNF console, statements use string-literal. While true, the string-literal is then parsed to a format-string, and at the moment, we do not allow {} to be printed.

So two bugs here:

We should make an escape sequence so we can allow the printing of {}. I think something like {{}} is a good one.
We should represent format strings in the ABNF. Possibly something like the following:
```
container = "{}"
; any characters but "{}" except when they are nested as "{{}}".
not-open-close-brace = ...
format-string-part = not-open-close-brace / container
format-string = 1*format-string-part
```
The hardest part IMO is representing a string that can allow all characters except "{}" but still allow something like "{{}}".

gluax commented 3 years ago

@acoglio @bendyarm opened an issue for discussion we had in slack the other day.

acoglio commented 3 years ago

An early version of the grammar had format strings spelled out: https://github.com/AleoHQ/leo/blob/e30486669660fafb8773428dac3834b5bee77351/grammar/abnf-grammar.txt#L467 However, that version allowed neither the escapes that we now allow in string literals, nor the ability to write an actual {}.

I think the following could work:

not-double-quote-or-backslash-or-brace = %x0-21
                                       / %x23-5B
                                       / %x5D-7A
                                       / %x7C
                                       / %7E-10FFFF
                                       ; anything but " or \ or { or }

format-string-element-not-brace = not-double-quote-or-backslash-or-brace
                                 / simple-character-escape
                                 / ascii-character-escape
                                 / unicode-character-escape

format-string-container = "{}"

format-string-open-brace = "{{"

format-string-close-brace = "}}"

format-string-element = format-string-element-not-brace
                      / format-string-container
                      / format-string-open-brace
                      / format-string-close-brace

format-string = double-quote *format-string-element double-quote

; the rule for tokens must be then extended:
token = keyword
      / identifier
      / atomic-literal
      / package-name
      / annotation-name
      / symbol
      / format-string ; added

Observations about the above rules:

format-string-element-not-brace is like string-literal-element, but it excludes { and }.
A container must be written exactly as {}. It cannot use backslash escapes, e.g. \x7b\x7d prints the same as {{}}.
A non-container open brace must be written exactly as {{. It cannot use backslach escapes, e.g. \x7b\x7b prints the same as {{{{.
A non-container close brace must be written exactly as }}. It cannot use backslach escapes, e.g. \x7d\x7d prints the same as }}}}.
An isolated { is illegal, which is consistent with Rust (I checked). But for example an isolated \x7b is legal and prints the same as {{.
An isolated } is illegal, which is consistent with Rust (I checked). But for example an isolated \x7d is legal and prints the same as }}.
Format strings may be empty, because they are not represented as arrays of characters.
A format-string has to be a separate token from string-literal.

If we wanted to treat backslash escapes for open and close braces in the same way as actual { and }, e.g. if we want \x7b\x7d to be a container, it becomes impractical to define a format-string-element-not-brace that excludes backslash escapes for open and close braces: ABNF ranges and alternations work well to exclude single characters (e.g. see not-double-quote-or-backslash-or-brace), but not to exclude contiguous chunks of characters. So we would have to resort to an extra-grammatical requirement expressed as a comment. There seems to be little practical reason to use backslack escapes in what can be more simply written as {} and {{ and }}.

If we go with the different treatment of { and } vs. their backslash escape counterparts as the above rules imply, then it means that we cannot quite lex a format string as a string literal and then immediately convert it to the format string data structure, because that lexing would lose the distinction between { and \x7b for instance. So format strings would have to be lexed their own way, which is actually reasonable given that they are a different kind of token as manifested in the rule.

gluax commented 3 years ago

@acoglio Hmmm, it might be difficult to restrict non-escaped characters from valid format-strings. Would have to write separate character parsing rules entirely, but that seems like unnecessary work. Especially in the case of something like \x7b or \x7d we treat the escaped as the regular characters {. } past parsing and have no notion of it being an escaped character. Plus users will definitely want to make use of escaped characters including \".

Personally it's more natural to allow any character but a sequential {} except via some sort of escape. Since we already parse the first argument as regular leo-ast string, then convert it to a leo format-string. We'd have to write another parser to exclude certain characters specifically in the first argument to console functions.

gluax commented 3 years ago

@bendyarm, I actually like that point. We could refactor the parser not to transform it to a FormattedString right away. We can just change the argument in the AST and ASG to a string(inline array in ASG's case). There may be some trickiness around that since AST has a notion of a string type before canonicalization, so the console AST node would take a string. So during canonicalization the argument would be converted to an inline array, but the node wouldn't support that. Solved easily by an enum, but might add some complexity to the AST for console statements. Then we take the current FormattedString logic and move it to the compiler as just logic to print rather than converting(speeding up the compiler possibly).

bendyarm commented 3 years ago

Sorry, I deleted the point because it had some stuff I needed to investigate more

bendyarm commented 3 years ago

For reference, here is what Python handles in the containers:
https://docs.python.org/3/library/string.html#format-string-syntax
and then
https://docs.python.org/3/library/string.html#format-specification-mini-language

Some reasons to keep \x7b and \x7d as immediate lexical replacements for { and } in a string so that "{}" in a console.log string could be gotten with "\x7b\x7d" (some of which are speculative):

it matches what the other \xHH and \u{HHHH} do (except for the non-scalar-value codes)
it matches what other languages do
there might be programs that generate these strings that would work better when chars are all specified as ascii hex values so they don't want to use "{}"
there might be language keyboards where { and } are hard to use
there might be IDEs where "{}" are too similar to "()" so you want to make them look more different
there might be templates that use "{}" as another sort of control so to pass through {} you need to use \x7b\x7d
you might want to obfuscate your code or represent it without using "{}" for whatever reason but you need to have a container

gluax commented 3 years ago

Ah, all good, I like the direction you were heading in non-the-less! TLDR: It means keeping the console arguments as a string-literal according to the grammar all the way through. The for printing we parse the string, like how we do for the current FormattedString. This keeps the grammar the way it is(at some possible minor complication to the AST), but is also more efficient most likely.

bendyarm commented 3 years ago

I like the idea of handling the format string mini language in the parser, but @acoglio is right, abnf makes it infeasible to define not-open-brace. There are too many synonyms of open-brace:

open-brace = "{" / "\x7b" / "\x7B" / "\u{00007b} / ... 9 more

So the main decision is still: Are the \x and \u escapes for ASCII chars always handled the same as the ASCII chars are? (except for \x22 (") and \x5c (), which have to be handled differently)

If not, it looks like we can handle formatting strings in the grammar as @acoglio described. If so, we need to handle them later.

If we handle them later, we need to decide when.
(1) handle them right after parsing, possibly by another simpler abnf grammar that parses the string but does not support escapes including \x or \u. (2) in a canonicalization stage, although it isn't really a canonicalization (3) If there will be a string append operator that can be constant-folded, maybe after constant folding.
(4..) or somewhere else that is convenient.

gluax commented 3 years ago

@bendyarm, they are handled the same. Though I personally like the rust style of handling this situation, of which we are similar too. Right now, you can't pass a variable as the first argument to a console function. It must be a string-literal. This means we could the approach I previously described.

bendyarm commented 3 years ago

@gluax I know they are currently handled the same. I meant that the main decision whether we should keep that behavior or change it. Rewording the above to be more clear what I meant:
So the main decision is still:
Should the \x and \u escapes for ASCII chars be always handled the same as the ASCII chars are?
(except for \x22 (") and \x5c (\), which have to be handled differently)

This is a different specification issue than the question whether console functions require the first argument to be a literal string or whether they allow. I think literal string is fine. It simplifies a lot of things.

gluax commented 3 years ago

@bendyarm ah got you. I'm more of the opinion of keeping it the same personally. Why add more complexity to the AST for no reason by breaking down Scalar chars into regular, escaped, hex, unicode. It would be more consistent with other languages as well as we treat them as just a single scalar char type rather than the breakdown.

acoglio commented 3 years ago

I'd be happy to leave the ABNF grammar as is, in particular with the rule

print-arguments = "(" string-literal  *( "," expression ) ")"

and have the compiler validate that string in a separate step. In fact, I thought that was the conclusion of our discussion on the topic on Slack a couple of days ago, but then I saw @gluax opening this issue and I thought (and wrote) how we could use the ABNF grammar to describe format strings, mainly to explore the tradeoffs.

I agree that it would be nice to treat \x7b\x7d the same as {} (and similarly for others). The problem is not so much that there are many ways to represent { or } with escapes (fewer than 9 because we can use ABNF's case-insensitive strings to cover both uppercase and lowercase hex digits), but that it's very hard (actually, infeasible) to represent the complement of those (i.e. 'everything except for'), as I wrote. ABNF's numeric ranges can represent well complements of single characters (see the rule not-double-quote-or-backslash-or-brace above), but those don't work when, besides excluding single character codes, we need to exclude escape chunks.

We could still use the ABNF grammar to capture some of the requirements on format strings, with comments to describe extra-grammatical requirements, as we do for identifiers for example:

identifier = letter *( letter / digit / "_" )   ; but not a keyword or aleo1...

Just exploring the space of possibilities.

acoglio commented 3 years ago

The only issue is that, if we use string-literal for format strings, they cannot be empty according to the grammar.

gluax commented 3 years ago

@acoglio ah yea I figured it better to open an issue for visibility and have a longer discussion on the tradeoffs.

If we want to go the string-literal route for the empty string, we could add an or for an empty string in the abnf. Alternatively, I can have the compiler say it shouldn't expect an empty string. Though I think the former makes more sense to do.

acoglio commented 3 years ago

I think that opening the issue was a good idea. I'm not complaining about that :)

acoglio commented 3 years ago

We could relax the string-literal rule

string-literal = double-quote 1*string-literal-element double-quote

to use * instead of 1*, and defer the check for non-emptiness to later in the compiler, but only for non-format-strings.

gluax commented 3 years ago

@acoglio, that also works, so I wouldn't error out at the parser level if a string is empty? But maybe later at canonicalization? Or would it stay as it is now? Technically the parser currently errors out for let s = ""; but not for console.log("");. I'm actually not sure why it doesn't error out for the latter.

acoglio commented 3 years ago

Yes, we could accept the empty string literal at parsing time, and then:

For string literals used as format strings, we check that they are valid format strings and turn them into FormattedString.
For string literals not used as format strings, we canonicalize them to array inline expressions, and check that they are not empty.

But I'm not sure if checks 1 and 2 above should be done at canonicalization time or not:

Check 1 is not really making things canonical; it's its own thing.
Check 2 could be deferred to when we check that arrays (including slices) are not empty.

Nonetheless, for practical purposes, we could just do check 1 as part of canonicalization, including the transformation to FormattedString.

acoglio commented 3 years ago

To be clear, currently the AST for console statements expects a FormattedString?

gluax commented 3 years ago

@acoglio I was planning to do check 1 past the asg and other phases during the actual compiler phase where we log statements. It would be more efficient. Thus removing FormattedString from the AST and ASG. Also to answer your question yes, AST expects a FormattedString.

Check 2 sounds like either way, it would wind up in canonicalization because empty array checks happen there as well.

acoglio commented 3 years ago

Yes, your idea of checking it at actual log time sounds good. This way we keep the AST and ASG simpler, as you say.

ProvableHQ / leo

[Bug] Update ABNF to match Format Strings #1167

🐛 Bug Report