[Data Liberation] EBNF processor

adamziel commented 5 days ago

I started drafting an EBNF notation processor to potentially enable building parsers based on a grammar file, similarly to the MySQL parser (although that one is based on ANTLR grammar). I don't have any specific action for this issue, I only wanted to dump the code somewhere it would be searchable in the future. CC @janjakes

<?php

class EBNF_Processor {
    private $ebnf;
    private $bytes_already_parsed = 0;
    private $last_error;

    private $rule_name;
    private $rules = [];

    private $state = self::STATE_READY;

    const STATE_READY = 0;
    const STATE_FINISHED = 1;

    public function __construct($ebnf) {
        $this->ebnf = $ebnf;
    }

    public function get_last_error() {
        return $this->last_error;
    }

    public function is_finished() {
        return $this->state === self::STATE_FINISHED;
    }

    public function next_rule() {
        if($this->is_finished() || $this->get_last_error()) {
            return false;
        }

        while(true) {
            if(false === $this->skip_comments()) {
                return false;
            }
            if(false === $this->parse_rule_name()) {
                return false;
            }

            die();
        }

        $this->state = self::STATE_FINISHED;
        return false;
    }

    private function parse_branches() {
        $at = $this->bytes_already_parsed;
        if($at >= strlen($this->ebnf)) {
            $this->last_error = 'Unexpected end of input';
            return false;
        }
        switch($this->ebnf[$at]) {
            case '(':
                break;
            case '"':
            case "'":
                break;
        }
    }

    private function skip_whitespace($at) {
        return strspn($this->ebnf, " \t", $at);
    }

    private function skip_comments() {
        while(true) {
            $at = $this->bytes_already_parsed;
            if($at + 2 >= strlen($this->ebnf)) {
                return false;
            }

            if(
                $this->ebnf[$at] === "/" &&
                $this->ebnf[$at + 1] === "/"
            ) {
                // Skip comment
                $this->bytes_already_parsed = strpos($this->ebnf, "\n", $at) + 1;
            } else {
                break;
            }
        }
    }

    private function parse_rule_name() {
        $at = $this->bytes_already_parsed;
        // Skip whitespace at the beginning of the rule
        $at += strspn($this->ebnf, " \t", $at);

        $rule_name_length = strspn($this->ebnf, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", $at);
        if( 0 === $rule_name_length ) {
            $this->last_error = 'Invalid rule name';
            return false;
        }
        $this->rule_name = substr($this->ebnf, $at, $rule_name_length);
        $at += $rule_name_length;
        $at += $this->skip_whitespace($at);

        // Skip past the '::='
        if( strlen($this->ebnf) <= $at + 3 ) {
            $this->last_error = 'Unexpected end of input';
            return false;
        }
        if( $this->ebnf[$at] !== ':' || $this->ebnf[$at + 1] !== ':' || $this->ebnf[$at + 2] !== '=' ) {
            $this->last_error = 'Expected "::="';
            return false;
        }
        $at += 3;

        $this->bytes_already_parsed = $at;
    }
}

JanJakes commented 4 days ago

@adamziel Nice! It would be great to have that, and I think it's better to use the EBNF syntax than ANTLR because the ANTLR format can misleadingly imply the usage of a specific toolset. Having that, we could then add a custom ENBF-based grammar toolset for analyzing, converting, and compressing the grammars.

I actually think we could switch to EBNF in https://github.com/WordPress/sqlite-database-integration/pull/157 because I see us more likely to maintain our own MySQL grammar rather than trying to synchronize if from MySQL Workbench. That said, we'd need support for parametrization (conditionally enabling or disabling some rules) to make it MySQL version aware. I don't think EBNF supports that out of the box, but it seems this could help us: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Extensibility

adamziel commented 4 days ago

It would be nice to have both ANTLR and EBNF support! I only started exploring this to see how easily we could parse Markdown. I ended up using a parser library for now but I'd like to revisit this eventually and either .g4 or .ebnf seem fine.

JanJakes commented 4 days ago

@adamziel We could do that too! After all, the basic syntax is very similar. In any case, having a custom lean grammar toolset would be great, as the existing tools we've found so far weren't great.

JanJakes commented 4 days ago

@adamziel One more open question here is lexing. Each new parser will either need a manually written lexer (some could be fairly simple), or we'd need to build a more generic one (maybe similar to Phlexy by @nikic).

adamziel commented 4 days ago

We could support character ranges and run the parser directly on the input string

WordPress / wordpress-playground

[Data Liberation] EBNF processor #1981