BenjaminSchaaf / sbnf

A BNF-style language for writing sublime-syntax files
MIT License
58 stars 6 forks source link

Stack overflow without recursion #6

Closed michaelblyons closed 4 years ago

michaelblyons commented 4 years ago

I started working on a Lisp parser. I have middling experience with sublime-syntax, but none at all with BNF. Lisp lets you use almost any character in a symbol so long as it is not only digits and .s. (It gets worse than that with delimiters, but I didn't get that far.) I tried to represent that in the example below, but I blow the stack.

name: Lisp SBNF

main = ( ~( comment
          | symbol
       )  )*
     ;

symbol-char-must-have = '[^\s()\'"/,:;|\d.]|(\\.)'{1: constant.character.escape}; # May not be only digits or periods
symbol-char = '\d' | `.` | symbol-char-must-have ;

symbol =
        symbol-char*
        symbol-char-must-have
        symbol-char*
    ;

comment{comment.line.semicolon} =
        ';+'{punctuation.definition.comment}
        ~'$\n?'
    ;

I don't think there are any loops in the hierarchy, but I am prepared to be wrong.

BenjaminSchaaf commented 4 years ago

That's a good find. SBNF shouldn't ever get into an infinite loop, but I think this is one of those edge cases that just had to be found through experimentation. The reason it is looping in this case is due to branch points: The symbol must have a branch point as it is ambiguous where symbol-char-must-have is matched, but since the branch point is in a loop SBNF also gets into a loop when trying to exhaustively produce contexts for the branch point.

I'll work on a fix for this, but here's a couple things that may help your understanding of SBNF:

SBNF translates into a set of sublime-syntax contexts. 'a' 'b' translates into two contexts, one matching the regex 'a' and sets the context that matches the regex 'b'. Like most other sublime syntaxes, any amount of whitespace is allowed between the contexts. So 'a' 'b' actually matches all of the following:

ab
a     b
a
b

As such, SBNF rules can't replace sublime-syntax regexes. If you're trying to match a contiguous symbol you will always have to do so with regex, regardless of whether you're using sbnf or sublime-syntax. Here's how you'd rewrite symbol to not require a branch point:

symbol
= '(?:\d|\.|[^\s()\'"/,:;|\d.]|(\\.))*[^\s()\'"/,:;|\d.]|(\\.)(?:\d|\.|[^\s()\'"/,:;|\d.]|(\\.))*'
  {1: constant.character.escape, 2: constant.character.escape, 3: constant.character.escape}
;
# or
symbol = symbol-impl['[^\s()\'"/,:;|\d.]|(\\.)'] ;
symbol-impl = '(?:\d|\.|#[MUST_HAVE])*#[MUST_HAVE](?:\d|\.|#[MUST_HAVE])*'
              { 1: constant.character.escape
              , 2: constant.character.escape
              , 3: constant.character.escape
              }
            ;
michaelblyons commented 4 years ago

So 'a' 'b' actually matches all of the following:

Thanks! Mental model adjusted. (I thought I would be creating a whitespace matcher and using that in between terminals.)

symbol = symbol-impl['[^\s()\'"/,:;|\d.]|(\\.)'] ;
symbol-impl = '(?:\d|\.|#[MUST_HAVE])*#[MUST_HAVE](?:\d|\.|#[MUST_HAVE])*'
              { 1: constant.character.escape
              , 2: constant.character.escape
              , 3: constant.character.escape
              }
            ;

It took me a while to realize that #[MUST_HAVE] is coming from this rule in arguments

An identifier that does not reference a rule is a free variable unique to the rule's scope. It matches any argument and may be passed on and or interpolated.

Does that imply that I can create and reuse variables like sublime-syntax has by doing this very contrived example:

octet = '(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9][0-9])|(?:[1-9][0-9])|[0-9])' ;

ipv4-impl = '(?:(?:#[OCTET]\.){3}#[OCTET])'{constant.numeric.ip-address.v4} ;
ipv4 = ipv4-impl[octet] ;

rgb-impl = 'rgb(\()(#[LEVEL])(,)\s*(#[LEVEL])(,)\s*(#[LEVEL])(\))'
        { meta.foo
        , 1: punctuation.section.begin
        , 2: constant.numeric.integer.decimal
        , 3: punctuation.separator.sequence
        , 4: constant.numeric.integer.decimal
        , 5: punctuation.separator.sequence
        , 6: constant.numeric.integer.decimal
        , 7: punctuation.section.end
        } ;
rgb = rgb-impl[octet] ;
BenjaminSchaaf commented 4 years ago

Ah, sorry. I made a mistake in constructing my suggestion. It's supposed to be:

symbol
= '(?:\d|\.|[^\s()\'"/,:;|\d.]|(\\.))*[^\s()\'"/,:;|\d.]|(\\.)(?:\d|\.|[^\s()\'"/,:;|\d.]|(\\.))*'
  {1: constant.character.escape, 2: constant.character.escape, 3: constant.character.escape}
;
# or
symbol = symbol-impl['[^\s()\'"/,:;|\d.]|(\\.)'] ;
symbol-impl[MUST_HAVE]
= '(?:\d|\.|#[MUST_HAVE])*#[MUST_HAVE](?:\d|\.|#[MUST_HAVE])*'
  { 1: constant.character.escape
  , 2: constant.character.escape
  , 3: constant.character.escape
  }
;

The MUST_HAVE variable is required to also be in the argument list of the rule.

Does that imply that I can create and reuse variables like sublime-syntax has by doing this very contrived example:

Kind of. The problem with your example is that it's trying to interpolate a rule into a regex. There's no current way to declare a global variable, as such the only way to pass a regex/string to a rule is directly:

ipv4-impl[OCTET] = '(?:(?:#[OCTET]\.){3}#[OCTET])'{constant.numeric.ip-address.v4} ;
ipv4 = ipv4-impl['(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9][0-9])|(?:[1-9][0-9])|[0-9])'] ;

rgb-impl[LEVEL]
= 'rgb(\()(#[LEVEL])(,)\s*(#[LEVEL])(,)\s*(#[LEVEL])(\))'
{ meta.foo
, 1: punctuation.section.begin
, 2: constant.numeric.integer.decimal
, 3: punctuation.separator.sequence
, 4: constant.numeric.integer.decimal
, 5: punctuation.separator.sequence
, 6: constant.numeric.integer.decimal
, 7: punctuation.section.end
} ;
rgb = rgb-impl['(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9][0-9])|(?:[1-9][0-9])|[0-9])'] ;

Global variables would be a good addition imo, though a separate issue :)