BenjaminSchaaf / sbnf

A BNF-style language for writing sublime-syntax files
MIT License
58 stars 6 forks source link

Statically parametrized rules #4

Closed FichteFoll closed 4 years ago

FichteFoll commented 4 years ago

Parametrized rules would be a way of reducing repetition for otherwise highly similar rules that only differ in certain attributes, e.g. scope names, pattern matches or syntax to include.

A good example use case would be markdown fences for embedded code, which all follow the same structure but differ in the identifier to recognize for a language, the language to be pushed, and the scope name to be used for the embedded block. Refer to some randomly selected examples from Markdown.sublime-syntax. My idea would be to allow static parameters for contexts that can be used anywhere in the rule in a macro-substitution kind of way. Such parameters should be able to be passed downwards and would in most cases be strings (or regular expressions), but being able to substitute entire rules would be neat as well. This makes the concept more complex, however, since parameters would have different types.

Syntax Brainstorming

The following syntax contains a few not-yet defined elements, such as embedding other languages as trackd in #2, and is mostly meant as an illustration.

I chose the syntax mostly on a whim, but I do believe it makes sense to re-use {{variable}} for replacing identifiers within regular expressions. Within the other locations, notably scope names, I am not so sure because the braces start getting noisy there. A different syntax or symbol, such as $variable, may be better. Also the way in which parameters are specified needs to be thought of. I tried mapping the patterns to either regular expressions or literals for now.

main = code-block('graphviz', `source.dot`, `graphviz`)
     | code-block('javascript|js', `source.js`, `javascript`)
     ;

code-block(ident, base_scope, sub_scope)
= code_fence_block_start{meta.code-fence.definition.begin.{{sub_scope}}.markdown-gfm}
  '(?i:{{ident}})'{constant.other.language-name.markdown}
  code_fence_infostring?
  !embed(scope:{{base_scope}}, scope: markup.raw.code-fence.{{sub_scope}}.markdown-gfm)
  code_fence_block_end{meta.code-fence.definition.end.{{sub_scope}}.markdown-gfm}
;

Unfortunately, this isn't what I would call easy to read. It should also be noted that this pattern makes use of backreferences which are currently not supported as well.

Current implementation (still not easy enough to read without syntax highlighting):

- match: |-
     (?x)
      {{fenced_code_block_start}}
      ((?i:graphviz))
      {{fenced_code_block_trailing_infostring_characters}}
  captures:
    0: meta.code-fence.definition.begin.graphviz.markdown-gfm
    2: punctuation.definition.raw.code-fence.begin.markdown
    5: constant.other.language-name.markdown
  embed: scope:source.dot
  embed_scope: markup.raw.code-fence.graphviz.markdown-gfm
  escape: '{{code_fence_escape}}'
  escape_captures:
    0: meta.code-fence.definition.end.graphviz.markdown-gfm
    1: punctuation.definition.raw.code-fence.end.markdown
- match: |-
     (?x)
      {{fenced_code_block_start}}
      ((?i:javascript|js))
      {{fenced_code_block_trailing_infostring_characters}}
  captures:
    0: meta.code-fence.definition.begin.javascript.markdown-gfm
    2: punctuation.definition.raw.code-fence.begin.markdown
    5: constant.other.language-name.markdown
  embed: scope:source.js
  embed_scope: markup.raw.code-fence.javascript.markdown-gfm
  escape: '{{code_fence_escape}}'
  escape_captures:
    0: meta.code-fence.definition.end.javascript.markdown-gfm
    1: punctuation.definition.raw.code-fence.end.markdown
BenjaminSchaaf commented 4 years ago

Ideally there would be more than just parameterization, like conditionals and "loops". Take for instance the Prolog syntax I've written. SWI-Prolog and ISO Prolog have a few syntax differences (IIRC specifically for strings), so it would be useful to generate two syntaxes from the single sbnf.

Considering BNF's history with boolean algebra I think it makes sense to look at logic programming for inspiration on syntax. I'm not married to the following syntax, but here's my proposal:

# Arguments to main are passed from the command line
main[TYPE] = string[TYPE] foo[TYPE] ;

# Conditionals
string['SWI'] = 'blah' ;
string['ISO'] = 'blah' ';' ;

# Parameterize a rule
ctx = foo[end1] foo[end2] ;
end1 = ';' ;
end2 = '.' ;
foo[END] = 'blah' END ;

# Parameterize a string/regex
ctx = foo[';'] foo['.'] ;
foo[END] = 'blah' END ;

# Mixin for parameters and arguments
ctx = foo['source.dot', 'graphviz'] ;
foo[SCOPE, NAME] = '(?:<NAME>)'{embed: scope:<SCOPE>} ;

I'd like to use predicate logic to signify a "for each", but I've got no clue about what the syntax would look like without adding keywords or using unicode characters. Though keeping the language non-turning complete is also important, so a simple solution for loops might be best, and the easiest one is to just not have them:

foo
= manual-unroll['a']
  manual-unroll['b']
  manual-unroll['c']
  ( manual-unroll['d']
  | manual-unroll['e']
  | manual-unroll['f']
  )
;
BenjaminSchaaf commented 4 years ago

Just some extra notes on the syntax. I chose to use [] for parameters because () is ambiguous with groups, and I think keeping {} distinct would be beneficial as it only provides meta-data for the compiler.

As much as I like prolog's distinction of variables by case, I don't see the need in doing the same here. So the ALL-CAPS variables are simply a stylistic choice here.

I'm not sure about <> for string interpolation/mixins. It works better inside {} parameters, but worse inside strings. Having both {{}} and <> is even worse though.

Sorry for the pings, but I'd really like to get feedback on this before I start doing anything. This is something I want to get really right ;)

@keith-hall @TheSecEng

michaelblyons commented 4 years ago

cc-ing @Thom1729, just in case you were also interested.

keith-hall commented 4 years ago

Just some extra notes on the syntax. I chose to use [] for parameters because () is ambiguous with groups, and I think keeping {} distinct would be beneficial as it only provides meta-data for the compiler.

I agree, that sounds sensible :+1:

I'd like to use predicate logic to signify a "for each", but I've got no clue about what the syntax would look like without adding keywords or using unicode characters. Though keeping the language non-turning complete is also important, so a simple solution for loops might be best, and the easiest one is to just not have them

I can't think of any scenarios where I really needed a loop while developing syntax definitions previously, so for me the manual unrolling is fine - anything I am likely to need wouldn't be more than a few repetitions - let alone unbounded.

I'm not quite able to get my head around it all enough atm to offer any opinion on string interpolation/mixins, but I trust your judgement, if that helps! ;) It could be useful to see some minimal examples of how an SBNF grammar + specific arguments passed to main from the command line affect the "output/resolved" SBNF (and generated sublime-syntax), and to be able to visually see the difference in readability of using different tokens for string interpolation/mixins to help decide. (We'll want a sbnf.sublime-syntax to highlight the sbnf files to assist with said visualization/SBNF writing, ofc - I can see such a thing is gitignore'd atm unless I am mistaken?)

BenjaminSchaaf commented 4 years ago

I can't think of any scenarios where I really needed a loop while developing syntax definitions previously, so for me the manual unrolling is fine - anything I am likely to need wouldn't be more than a few repetitions - let alone unbounded.

Yep, I've thought about it some more and I don't think it makes sense to have loops. I don't think unbounded ones were ever on the table anyway, as unbounded compilation is not something I want to touch :)

We'll want a sbnf.sublime-syntax to highlight the sbnf files to assist with said visualization/SBNF writing, ofc - I can see such a thing is gitignore'd atm unless I am mistaken?

I've current left the compilation of sbnf.sbnf to users. Though I'm not sure if the sublime-syntax would be all that helpful, since this is all about meta-programming. The way I'd implement it in the compiler is through an intermediary step that evaluates the arguments and does the string interpolation.

Turns out the syntactical difference between SWI-Prolog and ISO Prolog is in comments. SWI supports nested comments /*/**/*/ while ISO does not /*/**/. So here's how it would look like to parameterize on that syntactical distinction:

name: <TYPE>-Prolog
extensions: pl pro
first-line: ^#!.*\bswipl\b

prototype[TYPE] = ( ~comment[TYPE] )* ;

comment[TYPE]
= '(%+).*\n?'{comment.line.percentage, 1: punctuation.definition.comment}
| multi-line-comment[TYPE]
;

multi-line-comment['SWI']{comment.block.nested}
= '/\*(\*(?!/))?'{punctuation.definition.comment}
  ( ~multi-line-comment['SWI'] )*
  ~`*/`{punctuation.definition.comment}
;

multi-line-comment['ISO']{comment.block}
= '/\*(\*(?!/))?'{punctuation.definition.comment}
  ~`*/`{punctuation.definition.comment}
;

main[TYPE]
= ( shebang
  | rule
  | fact
  )*
;

...

That would mean you'd be required to compile with arguments:

$ sbnf Prolog.sbnf SWI-Prolog.sublime-syntax SWI
$ sbnf Prolog.sbnf ISO-Prolog.sublime-syntax ISO

And the sbnf would evaluate to the following respectively:

name: SWI-Prolog
extensions: pl pro
first-line: ^#!.*\bswipl\b

prototype = ( ~comment )* ;

comment
= '(%+).*\n?'{comment.line.percentage, 1: punctuation.definition.comment}
| multi-line-comment
;

multi-line-comment{comment.block.nested}
= '/\*(\*(?!/))?'{punctuation.definition.comment}
  ( ~multi-line-comment )*
  ~`*/`{punctuation.definition.comment}
;

main
= ( shebang
  | rule
  | fact
  )*
;

...
name: ISO-Prolog
extensions: pl pro
first-line: ^#!.*\bswipl\b

prototype = ( ~comment )* ;

comment
= '(%+).*\n?'{comment.line.percentage, 1: punctuation.definition.comment}
| multi-line-comment
;

multi-line-comment{comment.block}
= '/\*(\*(?!/))?'{punctuation.definition.comment}
  ~`*/`{punctuation.definition.comment}
;

main
= ( shebang
  | rule
  | fact
  )*
;

...
keith-hall commented 4 years ago

Thanks, that clarifies things beautifully for me :)

FichteFoll commented 4 years ago

I very much like how conditionals and parametrizing contexts with expressions looks in your example. That's simple and concise, yet sufficient.

I'm also with you on manual unrolling, at least for now. Such a logic would not have to be implemented for this proposal immediately, if at all.

Just some extra notes on the syntax. I chose to use [] for parameters because () is ambiguous with groups, and I think keeping {} distinct would be beneficial as it only provides meta-data for the compiler.

Agreed. [] also work well in a declarative or pattern-matching sense.

I'm not sure about <> for string interpolation/mixins. It works better inside {} parameters, but worse inside strings. Having both {{}} and <> is even worse though.

I imagine we want to be able to use it inside regular expression patterns as well? In that case, it should be using a syntax that occurs infrequently, e.g. would be invalid syntax normally, as otherwise you have to deal with escape sequences et al. {{}} is a good fit.

Outside of such expressions, {{}} seems unnecessarily noisy. If a separate syntax was added for simple substitution, we could look towards other languages and bank on familiarity. ${} (or $name, if word boundaries are clear) is a frequently used pattern in e.g. Bash or JavaScript with generally known semantics and certainly seems better to me than <>. Neither of the special symbols are likely to be used in the strings we are substituting in.


Additionally, I believe this is the place where we need to decide what kind of parameters a rule can receive. Since we want to be able to pass and use expressions in parametrized rules, those must be possible. But we also want to perform pattern matching for conditionals and we would like to use string substitutions inside scope names (or embed patterns). As such, we could either double-purpose literal expressions or use a designated string type and syntax only for such parameters. Because 'match' expressions represent regular expression matches, we shouldn't use single quotes for such a string type (see my 'javascript|js' example in OP). Double quotes are still free, however.

BenjaminSchaaf commented 4 years ago

Additionally, I believe this is the place where we need to decide what kind of parameters a rule can receive. Since we want to be able to pass and use expressions in parametrized rules, those must be possible. But we also want to perform pattern matching for conditionals and we would like to use string substitutions inside scope names (or embed patterns). As such, we could either double-purpose literal expressions or use a designated string type and syntax only for such parameters. Because 'match' expressions represent regular expression matches, we shouldn't use single quotes for such a string type (see my 'javascript|js' example in OP). Double quotes are still free, however.

I think the way to go here is to have the same three base types as the rest of sbnf: rules, regexes and literal strings. Regexes and literal strings being implicitly converted depending on use, while rules staying separate with type errors where applicable.

BenjaminSchaaf commented 4 years ago

I've done a complete rewrite of the compiler to separate out compilation stages and implement parameterized rules. There's still some left over things to re-implement and update (notably embed, though I might temporarily remove those while a better syntax for them is invented).

I ended up going with #[] for string interpolation, as it works both inside {} and ' ', it's not new (it's similar enough to #{}/${} and it's used in the pug/jade/dt template syntax) and it's consistent with other parameters (ie. it also uses []). I think it has the least compromises in regards to consistency and familiarity.

BenjaminSchaaf commented 4 years ago

I haven't merged into master yet, but my changes can be viewed on the parameters branch.

FichteFoll commented 4 years ago

I experimented a bit with this and it's looking pretty amazing. Even rules can already be parametrized and used in the context. Good job!

BenjaminSchaaf commented 4 years ago

This has now been merged into master and is available for version 0.3.1