BenjaminSchaaf / sbnf

A BNF-style language for writing sublime-syntax files
MIT License
58 stars 6 forks source link

What's the best current/planned way to avoid some repetition without rule interpolation #34

Open eugenesvk opened 1 year ago

eugenesvk commented 1 year ago

I've been bitten by the inability to use rules in regexes, requiring to fallback to a composition of literal regexes https://github.com/BenjaminSchaaf/sbnf/issues/12, also read a couple of more tangenially related issues https://github.com/BenjaminSchaaf/sbnf/issues/14 https://github.com/BenjaminSchaaf/sbnf/issues/4

But still not very clear on what the best solution to the issue of avoiding some repetition, so I to though that maybe I could provide an example and ask for advice re. how to best deal with the situation:

Lets say I'd like to define the rules for number syntax in the following relatively clean manner

decimal{#[INT]decimal   }  : sign? integer (ddot integer)? exponent?                   ;
hex   {#[INT]hexadecimal}  : sign? base['x'] d-hex                                 ;
octal {#[INT]octal      }  : sign? base['o'] d-oct                                 ;
binary{#[INT]binary     }  : sign? base['b'] d-bin                                 ;

(ideally I'd have a function with a list of 3 parameters hex, octal, binary that would do the rest, but that's a separate issue)

The benefit of sign over [+-] is that I don't have to repeat scope definition in every rule

  sign                        :'[+-]'  {keyword.operator.arithmetic}                        ;

Same with the base function

  base[B]                     :'0#[B]' {#[CONST]base}                                       ;#

(d-hex etc. are d1-hex d1-hex-*, where finally d1-hex is a primitive regex of '[0-9a-fA-F]' and d1-hex- is the same with an _ (although _ aren't even allowed in rule names, so I can't express that properly in the name, but that's yet another issue), all with proper scopes)

But then this doesn't work since the rules create different syntaxes instead of a combined regex, so they'd fail when a 0b1 111 is space-separated, and as I've read in another issue, there is no way around matching consecutive symbols, you need regexes

So I need to have a single regex, but then rules are not allowed inside, so I'd need to create primitive regexes and store them in CLAUSES as global vars instead of rules since I can't fit rules in a regex. And then for every match in every number I'd have to repeat the group scope definitions :( Also, getting down to a list of primitive regexes isn't very ergonomic in more composable conditions

What is the best current/planned way to solve this issue?

BenjaminSchaaf commented 1 year ago

My suggestion for this either:

number : '([+-])(?x:
                 ([0-9]*)
                |(...)
                )'{0: sign, 1: whatever, 2: ...} ;

or

SIGN = '([+-])'
SIGN_SCOPE = 'keyword.operator.arithmetic'

decimal : '#[SIGN]...'{0: #[SIGN_SCOPE]} ;

I do find the regex composition suggestion here interesting, but it requires implementing full regex parsing.

eugenesvk commented 1 year ago

Thanks for the tips

Your first suggestion is what I've started with before trying to refactor it into a more readable list pasted in this issue precisely because this way requires repeating all the scope matches for every number type

The second one I'm already using with CLAUSES for scope prefix abbreviations (like #[S_M_INT]decimal where S_M_INT (scope,meta,int) is meta.number.integer, so you get meta.number.integer.decimal), will extend it to individual scopes S_SIGN and see how it works

I do find the regex composition suggestion here interesting, but it requires implementing full regex parsing.

By the way, would the (nonexistent) 'whitespace-strict' mode in Sublime syntax parser achieve the same? In that mode rules combo_a_b: rule_a rule_b; with each rule matching a single char would only match ab, but not a b?

BenjaminSchaaf commented 1 year ago

By the way, would the (nonexistent) 'whitespace-strict' mode in Sublime syntax parser achieve the same?

Sublime Syntax doesn't care about white space in any way; it's the same as any other characters. The primary difference between simple regexes and match stacks is that the former only works on a single line, whereas the latter works across multiple. Adding a way for SBNF to generate matches that are whitespace sensitive would still result in a meaningfully different grammar than a composed regex.

eugenesvk commented 1 year ago

it's the same as any other characters

ah, ok, than that mode would indeed be SBNF-related, not ST

Adding a way for SBNF to generate matches that are whitespace sensitive would still result in a meaningfully different grammar than a composed regex.

So if our goal is mathcing ab like so

  main:
    - match: '(a)(b)'
      captures:
        1: punctuation.separator.char.a.kdl2
        2: punctuation.separator.char.b.kdl2
      pop: true

(or in SBNF with regexes, not rules)

a-then-b: '(a)(b)' {1:punctuation.separator.char.a, 2:punctuation.separator.char.b};
main      : ~a-then-b;

then the SBNF rules even with a special mode

rule-a : 'a'{punctuation.separator.char.a};
rule-b : 'b'{punctuation.separator.char.b};
a-then-b: (?special-mode-whitespace-sensitive rule-a rule-b);

insead of ↓ matching a b

  main:
    - match: 'a'
      scope: punctuation.separator.char.a.kdl2
      push: rule-b|0
      pop: true
  # Rule: rule-b
  rule-b|0:
    - match: 'b'
      scope: punctuation.separator.char.b.kdl2
      pop: true
    - match: '\S'
      scope: invalid.illegal.kdl2
      pop: true

would still not be able to generate ↓ (which seems like the equivalent to regex?, but there could be an easy mistake, so it's still meaningfully different )

contexts:
  main:
    - match: 'a(?=b)' # only match if next is 'b'
      scope: punctuation.separator.char.a
      push: b
      pop: true
  b:
    - match: '(?!b)' # bail on non-'b'
      pop: true
    - match: 'b'
      scope: punctuation.separator.char.b
      pop: true
BenjaminSchaaf commented 1 year ago

would still not be able to generate ↓ (which seems like the equivalent to regex?, but there could be an easy mistake, so it's still meaningfully different )

It's very meaningfully different. With just the regex ac wouldn't match and both characters marked as invalid. With those rules aa is highlighted as valid, which is just incorrect. In fact the only way to actually do it correctly is with a branch point, so that when ST reaches the 2nd character it goes back and highlights the first one as invalid.