Replace headers with variables

BenjaminSchaaf / sbnf

A BNF-style language for writing sublime-syntax files

MIT License

58 stars 6 forks source link

Replace headers with variables #21

Closed BenjaminSchaaf closed 3 years ago

BenjaminSchaaf commented 3 years ago

Fixes #12

Example from sbnf.sbnf:

IDENTIFIER = '[[:alnum:]_\-\.]+'

variable : IDENTIFIER{entity.name.variable}
           `=`{keyword.operator.assignment}
           ( literal
           | regex
           | IDENTIFIER{variable.function} parameters?
           )
         ;

@mitranim @FichteFoll I'm not looking for a review, but I would appreciate thoughts on the syntax/semantic changes. The easiest way to see the big changes would be from the example in the README.

mitranim commented 3 years ago

Looks promising, glad to see some progress! 🔥

Slightly worried about collisions between arbitrary and "well-known" clauses (former headers). It could cause issues when adding new "well-known" clauses to the language. But having less total syntax is a pretty decent tradeoff.
Collisions could be minimized through case:
- Upper case for well-known, lower case for arbitrary: NAME get special treatment, name doesn't.
- Lower case for well-known (matching .sublime-syntax), upper case for arbitrary: name gets special treatment, NAME doesn't.
Still don't understand why clauses and rules need different syntax. We could just have a = b ; where the right-hand side determines if a is a simple clause or a rule. Well-known clauses such as name would have additional restrictions. This would also automatically enable lists, like extensions = 'clj' 'cljs' ;.
Style: I'd remove kebab support and use lower_snake_case for rules.
Style: I'd also use lower_snake_case for clauses. While lower vs. upper case allows to distinguish clauses and rules on a glance, I still lean towards eradicating rather than reinforcing the distinctions (see above).

FichteFoll commented 3 years ago

I primarily read the diff of the README.

+1 for specifying the value of clauses with the same syntax as for rules (quotation). Not sure what to make of NAME = 'this is (regex)', though. How will that render? I don't think having a static list of "well-known" clauses is that much of a problem. If compatibility is a concern, then a new SBNF_VERSION clause seems reasonable for the future.
-1 for changing the declaration operator from = to : for rules. If we made a distinction between rules and clauses by the name's casing, which I consider to be a good idea, we could use = for both and not break compatibility as well as introduce a different operator for basically the same purpose.
Requiring kebab-case for rule name seems like an arbitrary limitation, but it's following the most widely accepted convention for context names currently. They definitely should be lower-cased, though. +0

BenjaminSchaaf commented 3 years ago

@mitranim

Slightly worried about collisions between arbitrary and "well-known" clauses (former headers). It could cause issues when adding new "well-known" clauses to the language. But having less total syntax is a pretty decent tradeoff.

I agree, though this can be somewhat improved through better syntax highlighting, ie. using the support scope such that those identifiers are specially marked. An alternate solution would be to reuse the options for syntax-level metadata, eg:

[FOO]
{name: foo, scope: text.foo}

main : 'foo' ;

Still don't understand why clauses and rules need different syntax. We could just have a = b ; where the right-hand side determines if a is a simple clause or a rule. Well-known clauses such as name would have additional restrictions. This would also automatically enable lists, like extensions = 'clj' 'cljs' ;.

The problem is still how you deal with all the edge cases. Rules can have options, clauses cannot. Here's some ambiguity:

rule1 = 'foo' ;
rule2{foo} = 'foo'{bar} ;

# This compiles
example1 = rule1{foo} ;
# This can't
example2 = rule2{foo} ;

clause[rule1] = 'a' rule1 ; # overload 1
clause[rule2] = 'b' rule2 ; # overload 2

# This clearly instantiates overload 1
example3 = clause[rule1] ;
# This is ambiguous, even though the "value" of rule1 is also 'foo'
example4 = clause['foo'] ;

Style: I'd remove kebab support and use lower_snake_case for rules.

kebab-case is better here. It's already the standard for sublime-syntax files and it's very intuitive: Hold shift for clauses, don't for rules (at least on my keyboard layout).

Style: I'd also use lower_snake_case for clauses.

Assuming the distinction is kept then having that enforced in the names is a good thing imo.

BenjaminSchaaf commented 3 years ago

@FichteFoll

Not sure what to make of NAME = 'this is (regex)', though. How will that render?

That would render as this is (regex) as the name of the syntax, but match this is regex as part of a rule. It is a bit of a language oddity that I haven't found a solution to beyond having a global options as mentioned in my reply to mitranim.

If compatibility is a concern, then a new SBNF_VERSION clause seems reasonable for the future.

I'm not at all concerned about backwards compatibility for SBNF at this stage, breaking changes are expected.

-1 for changing the declaration operator from = to : for rules. If we made a distinction between rules and clauses by the name's casing, which I consider to be a good idea, we could use = for both and not break compatibility as well as introduce a different operator for basically the same purpose.

My primary concern with using = for both is that the type of statement being made on any line would be completely determined by the capitalization of an identifier, which is a little too subtle. Equality also fits best of clauses as those actually establish an equivalence, whereas rules are more of a relation/mapping.

mitranim commented 3 years ago

The problem is still how you deal with all the edge cases. Rules can have options, clauses cannot. Here's some ambiguity:
rule1 = 'foo' ;
rule2{foo} = 'foo'{bar} ;

…

clause[rule1] = 'a' rule1 ; # overload 1
clause[rule2] = 'b' rule2 ; # overload 2

…

# This is ambiguous, even though the "value" of rule1 is also 'foo'
example4 = clause['foo'] ;

Thanks for the explanation. I'm not convinced just yet.

Regarding ambiguities. The following approach may be able to avoid them (you judge):

Equality is always structural, rather than nominal; an overload like clause[A] is equivalent to clause[B] if A == B, where either A or B can be named or anonymous (inline).
Options participate in equality checking: 'literal' ≠ 'literal'{option}.
rule{option} = body internally translates into rule = (body){option}. Options always become part of RHS, rather than part of the name.

In the example above, clause['foo'] is equivalent only to clause[rule1], because only rule1 is equivalent to 'foo' without options.

Regarding options. Currently:

clause      = 'lit'        # OK
clause      = 'lit'{opt}   # ERR
clause{opt} = 'lit'        # ERR
clause{opt}                # OK
rule        : 'lit'{opt} ; # OK
rule{opt}   : 'lit'      ; # OK
rule{opt}                  # ERR
'#[clause]'                # OK
'#[rule]'                  # ERR

The user has to internalize 2 types of declarations, and a matrix of cases where you can and can't use options. I've been trying and failing to make sense of this.

Compare (requires passing options to rules):

clause    = 'lit'      ; # OK
rule      = 'lit'{opt} ; # OK
rule{opt} = 'lit'      ; # OK
rule{opt}                # OK
'#[clause]'              # OK
'#[rule]'                # ERR

In this scenario, you can use anything anywhere. Interpolation only works on "bare" rules where the value is a single "bare" string or regex. Passed options will either: work only on "bare" rules; override original LHS options; combine with original LHS options (depending on how we handle #14). Predefined identifiers such as name or scope (or NAME and SCOPE) may impose special restrictions on RHS. Subjectively, I find this easier to internalize than the current approach.

My apologies for pulling in another issue. I just feel it's very relevant. Ben asked for thoughts, so here they are. 🙂

mitranim commented 3 years ago

My latest comments could be a separate issue, about merging clauses and rules into one concept, which also depends on #12 and #14. It could be discussed after merging the PR. Of course that's completely up to Ben.

BenjaminSchaaf commented 3 years ago

@mitranim Currently options should work like this:

CLAUSE      = 'lit'        # OK
CLAUSE      = rule         # OK
CLAUSE      = CLAUSE       # OK
CLAUSE      = 'lit'{opt}   # ERR
CLAUSE{opt} = 'lit'        # ERR
rule        : 'lit'      ; # OK
rule        : rule       ; # OK
rule        : 'lit'{opt} ; # OK
rule{opt}   : 'lit'      ; # OK
t =
  rule{opt}                # OK assuming #14 is resolved
  '#[CLAUSE]'              # OK for CLAUSE = ''
  '#[rule]'                # ERR
;

Which is really just: only rules can have options, only strings can be interpolated.

Here's an alternate comparison with your proposal:

rule      = 'lit'      ; # OK
rule      = rule       ; # OK
rule      = 'lit'{opt} ; # OK
rule{opt} = 'lit'      ; # OK
t =
  rule{opt}              # OK
  '#[rule]'              # OK for rule = '' ;
  '#[rule]'              # OK for rule = rule2 ; rule2 = '' ;
  '#[rule]'              # ERR for rule{opt} = '' ;
  '#[rule]'              # ERR for rule = ''{opt} ;
  '#[rule]'              # ERR for rule = rule2 ; rule2{opt} = '' ; ...
;

While it's certainly syntactically simpler, it's semantically more complex. You have pretty much the same rules, but they're now all exceptions for string interpolation (and parameterisation currently). You're forced to keep track of which "rules" are designated for interpolation and which ones aren't with both syntaxes, but with mine it's explicit and with yours it's implicit:

identifier = '...' ;

variable = identifier{var} ;
definition = identifier{def} '{' variable* '}' ('(?=#[identifier])' 'foo')? );

# vs

IDENTIFIER = '...'

variable : IDENTIFIER{var} ;
definition : IDENTIFIER{def} '{' variable* '}' ('(?=#[IDENTIFIER])' 'foo')? );

With structural equivalence you're walking into the same limitation SBNF already has for regex equivalence. That's certainly not something I want more than once in the language:

# These two rules have identical grammars, but are structurally non-equivalent.
rule1 = 'foo' 'foo'* ;
rule2 = 'foo' rule2? ;

mitranim commented 3 years ago

Another consideration is forward compatibility when authoring a syntax. This will be especially relevant with imports (#9).

When clauses and rules are distinct, the author must take care to define every potentially-reusable regex as a clause:

IDENT = '\b[[:alpha:]_][[:alnum:]_]*\b'
ident : IDENT ;

But this requires more code than just defining a rule. To make a syntax more reusable, you must write (and read) more code. Writing less code (and keeping it more compact) makes parts of the syntax less reusable.

In the scenario where any "bare" rule is automatically a clause, authors don't need to worry about that. They can just:

ident = '\b[[:alpha:]_][[:alnum:]_]*\b' ;

... and the resulting rule is automatically usable in any context by any sub-syntax that imports this one. It's more forward-compatible.

mitranim commented 3 years ago

Looking at the example above, you could say there's no point to ident : IDENT ;, since you can use IDENT directly. That's true; the example is over-simplified. But the general point still stands. People will often define rules that could be clauses, producing less-reusable code.

BenjaminSchaaf commented 3 years ago

In the scenario where any "bare" rule is automatically a clause, authors don't need to worry about that.

Except they do, if they ever want to use that regex as part of string interpolation or as parameters.

That's true; the example is over-simplified.

Is it though? The only time you'd use a clause is exactly the same as any time you'd have a rule with no options.

mitranim commented 3 years ago

Clauses-and-rules have the following states:

CLAUSE            = 'terminal'        # AA
bare-rule         : 'terminal' ;      # AB
complex-rule{opt} : 'whatever'{opt} ; # AC

Just-rules have the following states. Assume that bare_rule can be used like a clause:

bare_rule         = 'terminal' ;      # BA
complex_rule{opt} = 'whatever'{opt} ; # BB

Assuming that clauses are strictly more flexible than equivalently-defined rules, state AB is undesirable, because it's strictly less flexible than state AA or BA. I'd like to eliminate undesirable states.

Within a single syntax file, this is a non-issue, because the author can always convert rules to clauses. However, with inheritance or imports (#9), this becomes an issue, because the "consumer" may not have the power to modify the syntax they're importing. So we're automatically losing flexibility.

We haven't explored all possible solutions yet. Off the top of my head, oversimplified:

Provide a way to "extract" a terminal from any unparametrized rule whose value is a single terminal, ignoring any options. The resulting terminal can be assigned to a clause.
Same as above, but such extraction is done automatically when interpolating. LHS/RHS options are either banned or ignored. This approach doesn't really require clauses as a separate concept.
Ban single-terminal rules, forcing the author to define them as clauses.

These aren't full solutions, they might have hidden issues like how to handle regexps with capture groups without options. I just hope this paints a clearer picture of what I've been trying to convey.

BenjaminSchaaf commented 3 years ago

I've taken some time to reassess these changes and have come to the decision to merge this as is. I'll outline my reasoning but before that I really want to express my gratitude for the feedback on this; it's these kinds of discussions that lead to better design decisions.

The way I see it there's two things that tip the scales here:

Rules-as-variables limits all future possible expansions of the language to only use rules and terminals. It can't be reasonably expanded to also support integers for example.
The previously mentioned inability to properly compare a rule and terminal for parameter matching if rules implicitly resolve to terminals.

When it comes to inheritance and variable -> rule conversions causing incompatibilities, you unfortunately get the exact same problem with rules-as-variables. An incompatibility is introduced if a scope is added to a terminal rather than a variable changed to a rule. In that sense a variable is a useful guarantee that it can always be interpolated, whereas rules would have to be specially marked to provide that same guarantee.

Additionally there are two workable alternatives as outlined by mitranim:

Allow extracting terminals from rules.
Allow interpolating rules if they consist of a single terminal.

Either of these are better solutions than rules-as-variables, although I think their use is yet to be determined. Both of these are relatively easy additions that can be made at a later point.