Engelberg / instaparse

Eclipse Public License 1.0
2.74k stars 149 forks source link

More auto-whitespace options #209

Closed sigvesn closed 3 years ago

sigvesn commented 3 years ago

I find the auto-whitespace feature very practical, and use it for almost every parser.

I do however miss some more granular control over how the whitespace rules are formed, which cannot be achieved by defining a custom whitespace parser:

  1. Non-optional whitespace rules: inserting <whitespace> instead of <whitespace?>
  2. Non-hidden whitespace rules: do not surround <whitespace?> with angle brackets
Engelberg commented 3 years ago

You're right that the auto-whitespace feature doesn't provide granular control; it simply aims to handle a common case of wanting insert <whitespace?> throughout the grammar.

Note that the whitespace parser is merged in with your parser, so you do have access to those rules if you want. To achieve the sort of granular control you're looking for, you could separate out the rules with the explicit whitespace notations (such as whitespace? or <whitespace>), process them with the ebnf combinator to produce a grammar map, and then run the rest of the rules through the auto-whitespace option to create a Parser, and merge it all together. A parser is just a grammar map, a start production, and an output-format, so all you need to do is merge your grammar map of the rules that use explicit whitespace into the grammar map of the parser built by the auto-whitespace option. Something like:

(update parser-processed-with-auto-whitespace :grammar merge (ebnf explicit-whitespace-rules))

sigvesn commented 3 years ago

Thanks, this will work great for altering specific rules, but i guess if i want to alter all whitespace-rules i will have to add them manually in the grammar.

For example: creating a parser that forces the separation of terminals. If i want to make this simple parser

(def words-and-numbers-auto-whitespace
  (insta/parser
    "sentence = token+
     <token> = word | number
     word = #'[a-zA-Z]+'
     number = #'[0-9]+'"
    :auto-whitespace :standard))

force the separation words and numbers, such that

"abc 123 45 de"

is a valid input, but not

"abc123 45de"

, the auto-whitespace feature will create a parser accepting both.

If i understand correctly, for this problem, applying your solution would be something like

(update words-and-numbers-auto-whitespace
        :grammar merge
        (ebnf "number = <whitespace> #'[0-9]+' <whitespace?>
               word = <whitespace> #'[a-zA-Z]+' <whitespace?>"))
Engelberg commented 3 years ago

I can think of a few options:

(def words-and-numbers-auto-whitespace
  (insta/parser
    "sentence = token (<whitespace> token)*
     <token> = word | number
     word = #'[a-zA-Z]+'
     number = #'[0-9]+'"
    :auto-whitespace :standard))

This enforces whitespace between tokens, but also gives you the liberal sprinkling of optional whitespace throughout the grammar, triggered by auto-whitespace. If you don't need the optional whitespace at all, you can of course control it all yourself:

(def words-and-numbers-auto-whitespace
  (insta/parser
    "sentence = token (<whitespace> token)*
     <token> = word | number
     word = #'[a-zA-Z]+'
     number = #'[0-9]+'
     whitespace = #'\\s+'"))

Third option would be to use regex lookahead or negative lookahead capability to build into your notion of tokens that they must be followed by spaces (or end of string). Or in this example another way to formulate it is that words cannot be followed by a digit and numbers cannot be followed by a letter.

I don't think it would be generally useful to have auto-whitespace replace all its instances of <whitespace?> with . The auto-whitespace option inserts whitespace very liberally, everywhere it is feasible. It can only do this because of its optionality. If every whitespace it inserted were mandatory, your grammar would probably end up requiring whitespace in lots of places you didn't expect it.

sigvesn commented 3 years ago

Yes, that makes sense. Feel free to close this issue and thank you for your help.

I would be interested to hear your thoughts on why whitespace handling is so different from a traditional lex/yacc parser. Is this simply the consequence of not having a separate lexer/tokenizing stage?

Engelberg commented 3 years ago

Yeah, that's exactly right. When you have a separate tokenizing phase, usually the whitespace is part of what is defining the token boundaries, and by the time you're at the grammar level, there's no whitespace to handle.