ceylon / ceylon-spec

DEPRECATED
Apache License 2.0
108 stars 34 forks source link

BNF patterns #902

Open gavinking opened 10 years ago

gavinking commented 10 years ago

Problem

Parsing text is an extremely common task, and therefore it would be very nice to have a convenient syntax for creating little parsers from a grammar.

Rant

The traditional thing that programming languages have provided for this is regular expression matching. A regex is a pattern of character classes with alternation and quantification, the operators |, *, +, ?. Unfortunately, regexes are entirely unfit for purpose, since any nontrivial parsing task, and even many trivial parsing tasks, require at least one of:

The fact that regexes provide none of these things is frankly laughable, and it's absurd that language designers have considered this a useful facility. And it's extremely hard to understand, given that language designers are usually at least half familiar with what a real parser looks like. Regexes are not even capable of matching nested parentheses, one of the most basic tasks in parsing.

Worse, regexes feature an awful syntax where what is a character and what is a "metacharacter" is not something clearly distinguishable to the eye.

Proposal

I propose that we introduce a special syntax for BNF patterns that allows parsers to be built via combination of parser rules. This facility would follow the general approach we have taken with operators in Ceylon, where the operators are defined against some general purpose interfaces, allowing different parser libraries to take advantage of the syntax.

Patterns are built from, and usually used to define rules. For example:

value number = ## digit+ (dot digit+)? (e sign? digit+)? ##;

Patterns need to be quoted, since the syntax of a pattern is not part of the expression syntax. I have gone with ## as the quote character here, though there are other options, including \\.

value number = \\ digit+ (dot digit+)? (e sign? digit+)? \\;

Identifiers name rules, for example, digit, e, sign, and dot.

The operators |, ?, *, and + represent alternation and quantification. Parens represent grouping. We might also have a ~ operator, not sure about that one.

Todo

The main outstanding thing to do here is to define a Rule interface that generalizes the proposed operators to enough reasonable parser libraries.

Note that this proposal is very straightforward to implement, since it essentially amounts the addition of some new operators to the compiler. It is not a whole lot of work.

WDYT?

tombentley commented 10 years ago

I'm not really sure I understand what you're proposing, to be honest.

FroMage commented 10 years ago

Is this supposed to address the other things addressed by regexes, like:

gavinking commented 10 years ago
  • Stuff between ## is a pattern, and the type of a pattern is Rule?
  • Hence we can construct a tree of Rules within Ceylon code?

Right.

  • At runtime we can pass a Rule instance to a 'parser library'

No, your parser library implements Rule. For example, a regex library would come with primitive rules like wordChar, charClass(), etc, and then you use patterns to build more sophisticated rules.

  • What do we get back from the 'parser library'?

WDYM? You get objects. I don't understand the question.

How does this address your gripes about AST production and tokenization?

Rule instances might accept a character stream, and produce Tokens. (A scanner). Or they might accept Tokens and produce Nodes. (A parser).

Rule is an extremely generic interface, along the lines of Ranged, something like that. The only thing it really specifies is composability via the operators.

gavinking commented 10 years ago

Is this supposed to address the other things addressed by regexes

  • Replacement
  • Capturing

Those are concerns of the parser library.

Look-ahead/behind

Whether a parser library is topdown, bottomup, a PEG, uses lookahead, syntactic predicates, or whatever, is not really relevant at this level of the discussion:

Eager/greedy modifiers

Similar answer:

value myNongreedyRule = ungreedy(myGreedyRule);

I have never once needed a non-greedy rule.

Match bounds (for ex \d{2})

I mean, we could add that, but I have never really run into much need for it. It's not like x{4} is really much superior to x x x x. (Of course you need it in traditional regexes because there are no subrules.) OTOH, I'm not actively against it. I've just not run into very many parsing tasks where it helps. And again, patterns are not the only way to produce rules:

value fourDigits = repeat(digit,4);

Locations (start of input, end of input, word boundary)

These are just primitive Rules, aren't they?

By the way, "word boundary" is something you never ever need if you use a proper separation between the scanner and parser, AFAICT.

gavinking commented 10 years ago

P.S. An issue I glossed over in the original proposal is whether a pattern can have function calls in it. If it can, then we can write stuff like greedy(rule), repeat(digit,4), charRange('a','b'), group(wildcard, "replacementGroupName") in a pattern. However, we would have to nail down the syntax of that. In a first cut of this, it's not necessary.

tombentley commented 10 years ago

Whether a parser library is topdown, bottomup, a PEG, uses lookahead, syntactic predicates, or whatever, is not really relevant at this level of the discussion

Is that really true?

gavinking commented 10 years ago

Things like ANTLR generate compiled parsers, rather than interpreted ones. I don't yet see how this proposal enables their use.

Sure, this is for interpreters, not for compiler compilers. I'm not trying to abstract over ANTLR and YACC. But it's surely possible to write pretty good interpreters for LL grammars and PEGs, which seem the most relevant grammars these days. LR grammars I must admit I'm not so sure about, I don't really fully understand bottom-up parsing.

The parser library in use will determine whether things like left/right recursion are supported. Personally I would want the IDE to tell me the grammar I'm writing isn't going to work because its left recursive, which my chosen library doesn't support, rather then discovering it at runtime.

I mean, that's feedback you could get from a one-line unit-test, if the library supported it. The overall workflow would be more convenient than with ANTLR, not less.

gavinking commented 10 years ago

So let me make this a bit more concrete by building a very simplistic character matching engine. Suppose we define Rule in ceylon.language like this:

shared interface Rule<T> of T 
        given T satisfies Rule<T> {
    shared formal T or(T other);
    shared formal T before(T other);
    shared formal T zeroOrOne();
    shared formal T zeroOrMore();
    shared formal T oneOrMore();
}

Then our scanner might look like this:

abstract class CharacterRule() 
        satisfies Rule<CharacterRule> {

    shared formal Integer? match(String stream, Integer start);

    shared Boolean matches(String stream, Integer start=0)
            => match(stream, start) exists;

    shared actual CharacterRule before(CharacterRule other) {
        object rule extends CharacterRule() {
            shared actual Integer? match(String stream, Integer start) {
                if (exists end = outer.match(stream, start)) {
                    return other.match(stream,end);
                }
                else {
                    return null;
                }
            }
            string => outer.string + other.string;
        }
        return rule;
    }

    shared actual CharacterRule zeroOrMore() {
        object rule extends CharacterRule() {
            shared actual Integer match(String stream, Integer start) {
                variable Integer position = start;
                while (exists end = outer.match(stream,position)) {
                    position = end;
                }
                return position;
            }
            string => outer.string + "*";
        }
        return rule;
    }

    shared actual CharacterRule oneOrMore() 
            => zeroOrMore().before(this);

    shared actual CharacterRule zeroOrOne() {
        object rule extends CharacterRule() {
            shared actual Integer match(String stream, Integer start)
                     => outer.match(stream,start) else start;
            string => outer.string + "?";
        }
        return rule;
    }

    shared actual CharacterRule or(CharacterRule other) {
        object rule extends CharacterRule() {
            shared actual Integer? match(String stream, Integer start)
                    => outer.match(stream,start) else other.match(stream, start);
            string => "(" + outer.string + "|" + other.string + ")";
        }
        return rule;
    }

}

class Lit(String characters) 
        extends CharacterRule() {
    match(String stream, Integer start) 
            => stream[start:characters.size]==characters then start+characters.size;
    string => characters;
}

class Class({Character*} characters) 
        extends CharacterRule() {
    match(String stream, Integer start) 
            => stream[start:start] in characters then start+1;
    string => "[" + characters.string + "]";
}

Finally, the following grammar would match numbers:

value digit = Class('0'..'9');
value dot = Lit(".");
value e = Lit("e");
value sign = Class { '+', '-' };
value number = ## sign? digit+ (dot digit+)? (e sign? digit+)? ##;
tombentley commented 10 years ago

@gavinking thanks, I think that illustrates what you're proposing quite well. And I do quite like what I see.

On a practical point, isn't there a bit of a chicken/egg problem here: Don't we at least need one working implementation of this before we go setting it in stone in the spec?

gavinking commented 10 years ago

Don't we at least need one working implementation of this before we go setting it in stone in the spec?

Well, I mean, since the only thing the Rule API defines is the signatures of |, ?, *, +, and sequencing, which I think are pretty well-understood, I don't see much danger of us getting it wrong.

OTOH, sure, we should write a very simple parser combinator library for the SDK, of course. I like this blog post as a simple overview, by the way: http://qntm.org/combinators.

tombentley commented 10 years ago

A thought occurred to me over the weekend: With a PEG the | operator has a different interpretation than in a normal CFG. Which means that you can't actually read one of these grammars and know what it will match -- you'd need to know which interpretation the parser library used for Rule.or.

We could separate out these two interpretations into different operators, of course. If we did that, perhaps we could also build ambiguity detection into the typechecker.

gavinking commented 10 years ago

With a PEG the | operator has a different interpretation than in a normal CFG.

Well, a slightly different interpretation.

gavinking commented 10 years ago

If we did that, perhaps we could also build ambiguity detection into the typechecker.

What is "ambiguous" still depends on what kind of grammar you have. An unambiguous LL(2) grammar can be an ambiguous LL(1) grammar. And that's just for LL parsers!

tombentley commented 10 years ago

What is "ambiguous" still depends on what kind of grammar you have

Ah, good point.

gavinking commented 9 years ago

Note that if we add support for AST transformers then we could in principle even accommodate parser compilers with this feature!

Indeed, AST transformers would allow some really simple parser implementations using stuff like compile-time translation to a regex with error reporting and everything.

That would make this feature even more attractive, it seems to me.

gavinking commented 9 years ago

perhaps we could also build ambiguity detection into the typechecker.

To be clear, what I'm saying above is that AST transformers would solve this problem.

EricSL commented 9 years ago

An issue I glossed over in the original proposal is whether a pattern can have function calls in it. If it can, then we can write stuff like greedy(rule), repeat(digit,4), charRange('a','b'), group(wildcard, "replacementGroupName") in a pattern. However, we would have to nail down the syntax of that. In a first cut of this, it's not necessary.

As a first cut, at least do it in a way that it will be possible to eventually reach feature parity with OMeta. It's an amazingly powerful language that I also find hard to read. But it should be possible to make more readable. The IronMeta variant looks better to me.

I find your syntax unconvincing as you don't give an example of multiple mutually recursive rules, and somewhat deeper language changes will be needed to give a clean way to do that.

EricSL commented 9 years ago

Actually I think the type system may almost be powerful enough to do this if the rules are all classes instead of objects.

class Calculator extends StringParser { // inherits nested classes like Star, Sequence, etc. class Number extends RecursiveRule<Plus> {} class Expression extends RecursiveRule<Choice<[AddExpression, SubtractExpression, ParenExpression, Number]>> {} alias OpenParen = Literal("(").RuleClass alias CloseParen = Literal(")").RuleClass class ParenExpression extends RecursiveRule<Sequence<[OpenParen, Expression, CloseParen]>> {} class BinaryOp(String operator) extends ParameterizedRule<[String]>(operator) { alias Operator = Literal(operator).RuleClass override RuleClass extends RecursiveRule<Sequence<[Expression, Operator, Expression]>> {} } alias AddExpression = BinaryOp("+").RuleClass alias SubtractExpression = BinaryOp("-").RuleClass override Expression parserRoot => Expression() }

The idea being that instantiating a parser causes the root rule to be instantiated, which in turn instantiates any rules it depends on, and adds them to a hash set unless it finds the same rule is already there in which case it knows it doesn't need to recurse on that rule. Two rules are equal if they have the same class; parameterized rules are equal if they have the same class and the same parameters.

EricSL commented 9 years ago

Okay, the type system isn't as powerful as I hoped. In particular there appears to be no way to instantiate a generic type parameter, and no way to create a class alias that is particular to a class within a particular instance.

So it comes down to language support for lazily instantiating the rules and allowing rules to refer to variables declared after them.