EoinDavey / tsPEG

PEG Parser Generator for TypeScript
Mozilla Public License 2.0
195 stars 7 forks source link

NPM Build Status

tsPEG : A PEG Parser Generator for TypeScript

tsPEG is a PEG Parser generator for TypeScript. tsPEG takes in an intuitive description of a grammar and outputs a fully featured parser that takes full advantage of the TypeScript type system.

Installation

tspeg can be installed by running

npm install -g tspeg

Features

Demos

The demos directory in the git repo contains several example grammars with explanations. These include 2 variants on a "calculator" which can evaluate expressions like 10 * (2-4) and a fully functional JSON parser.

CLI Usage

The CLI invocation syntax is as follows

tspeg <grammar-file> [output-file]

This generates a TypeScript ES6 module that exports a parse function, as well as classes that represent your AST. If [output-file] is omitted the parser is written to STDOUT.

Run tspeg --help to see usage and available flags.

Flags supported:

Parser Usage

The generated module exports a parse function that accepts an input string, and returns a ParseResult object like this:

class ParseResult {
    ast: START | null;
    errs: SyntaxErr[];
}

export function parse(input: string): ParseResult {
    ...
}

If the errs field is non-empty, then syntax errors were found, otherwise the AST is stored in the ast field. (START will be replaced with the type of your starting grammar rule).

Grammar Syntax

tsPEG grammars are specified with a simple syntax, similar to the classic EBNF syntax.

Grammars are composed of a sequence of rules, each rule defines what text that it should match, using regex literals, names of rules, and powerful operators like | for choice.

Each rule is defined by a name, followed by a := sign and then a "rule description". Rule descriptions are easiest seen by example here:

hello := 'Hello '
helloWorld := hello 'World'
helloChoice := hello 'Mars' | helloWorld

tsPEG starts parsing with the first rule in the grammar so to match the helloChoice rule we should either move it to the start or defined a start rule that points to it like this:

start := helloChoice
hello := 'Hello '
helloWorld := hello 'World'
helloChoice := hello 'Mars' | helloWorld

Putting this grammar into a file called "grammar.peg" and running tspeg grammar.peg parser.ts we get our parser for this simple grammar in the file "parser.ts". We can compile this (target at least es2015) and load it into NodeJS to test it.

Example 1

As you can see we get no errors for 'Hello World', and 'Hello Mars', this means the match was successful. But when we try 'Hello Jupiter' we get do get an error in the error list. It lists the location of the error, and the expected matches at that location, namely it expected to have a regex match for 'Mars' or 'World'. You can see more about errors in the Syntax Error section.

Notably the ast field of the parse result is empty for 'Hello World' and 'Hello Mars', it only contains a kind field that tell us which rule was matched. (See more on kinds). This is because we haven't told tsPEG what elements of the grammar we would like to be returned. When we write a rule definition, we can specify the fields we'd like to save in the AST by assigning them with an = sign. For example, say we want to match a string that's the sum of 2 numbers, e.g. "2+3", "123+456", we'd like to save both side of the sum in our AST, we can do this by writing a grammar like the following. Note that we had to write '\+' to escape the '+' symbol as the '+' symbol has special meaning in regex.

sum := left=num '\+' right=num
num := '[0-9]+'

Running tsPEG on this input and testing the parser in node we can see that the parser result has saved the left and right numbers of our sum. Note that we didn't have to assign the result of the [0-9]+ rule in the num rule. This is because rules that are just references to other rules are saved implicitly.

Sum example

Now that we have saved the numbers of our sum, it's easy to write a program to compute the result of adding the numbers. Each rule that assigns results to the AST is exported as a class or interface with the same name as the rule, so we can just import the sum interface from the parser to write our function. For this grammar we get an interface like:

interface sum {
    left: string;
    right: string;
}

Allowing us to write our calculator function like:

import { sum } from "./parser";

export function add(ast: sum): number {
    return parseInt(ast.left) + parseInt(ast.right);
}

Calling this file "calc.ts" we can use it to calculate our sums:

Calc screenshot

We can also use computed properties to calculate the result of the sum during the parsing process.

Operators

tsPEG supports a set of powerful operators to build complex grammars. The only operator we have seen so far is the | operator, which allows us to make choices between two rule expressions, however there are many more.

Sub-rules

Inline sub-rules can be specified in tsPEG. These use the { and } brackets for grouping, and allow you to write smaller rules inline in a larger rule. For example

rule := 'start' {some optional part}? 'finish'

We want the part in the middle to be optional, so we wrap it in { and } and apply the ? operator. The {} brackets create a sub-rule. Note that if you want to assign the subrules values to the tree you do need to assign a name to it e.g. rule := 'start' middle={some optional part}? 'finish'.

Syntax Errors

A SyntaxErr object is composed of two fields, a pos field with the position of the error, and expmatches which contains a list of expected matches. It also supports a toString method to return a readable text error message. (In production though it is recommended to use a more ad-hoc custom method).

Matches can be either a RegexMatch, or an EOFMatch. These can be disambiguated by checking the kind field (not the same as the AST kind fields). Each type contains a field "negated" denoting whether the match was expected to be successful or unsuccessful (e.g. if the ! operator was used).

RegexMatch objects are for attempts at matching regex literals, these are the most common. EOFMatch objects are for attempts at matching the EOF ($) match (See the section on EOF matching).

The definitions for each are below:

See details of the PosInfo object in the Position Tracking section..

interface RegexMatch {
    kind: "RegexMatch";
    negated: boolean;
    literal: string;
}

type EOFMatch = { kind: "EOF"; negated: boolean };

type MatchAttempt = RegexMatch | EOFMatch;

interface PosInfo {
    overallPos: number;
    line: number;
    offset: number;
}

class SyntaxErr {
    public pos: PosInfo;
    public expmatches: MatchAttempt[];
    public toString(): string {
        // ...
    }
}

Comments

C-style comments can be added to grammar spec files. Anything on a line after a // is a comment, and is ignored by the generator.

Regex Modifiers

You can add one of 4 regex modifiers to any regex literal to change their behaviour. These modifiers are:

To add a regex modifier simply type it immediately after the regex literal: e.g. 'a'i matches both 'a' and 'A'. You can specify multiple modifiers e.g. '^select$'im enables case insensitivity and multiline mode.

Computed Properties

As well as assigning parsing results to variables and storing them on the AST, tsPEG also allows you to create computed properties, which are fields on the AST that are computed when the parser is run.

Computed properties are added to a rule by appending a new expression after the rule description like

.<propertyname> = <type> { <code to calculate property> }

Returning to our sum example from earlier we had a grammar to match strings like "2+3", "40+200" etc.

sum := left=num '\+' right=num
num := '[0-9]+'

We can add computed properties to this to compute the value of this sum at parse time, instead of writing our own function to do it after. First we add a computed property to num to store the value of the number:

sum := left=num '\+' right=num
num := literal='[0-9]+'
       .value = number { return parseInt(this.literal); }

As you can see we've assigned the text match to the field literal, and added a computed property called value which is the literal field parsed as an integer.

Now we can add a computed property to sum to do the arithmetic, this is very simple

sum := left=num '\+' right=num
       .sum = number { return this.left.value + this.right.value }
num := literal='[0-9]+'
       .value = number { return parseInt(this.literal); }

We use the computed property of the num rule to compute our sum property. Let's see this in action! Let's try and parse the string "240+100".

Computed property

As you can see the AST has a field called sum with the correct value of 340 in it. The left and right also have their computed value properties of 240 and 100.

Header

The introduction of computed properties means that now you might want to import some types or functions into your parser. Luckily you can specify a header in the file that will be inserted directly into the generated parser. Anything at the top of the grammar file between two lines of three dashes --- will be inserted straight into the generated parser, allowing you to write grammars like

---
import { myFunc, myType } from "./mypackage";
---
rule := hello='Hello World'
        .value = myType { return myFunc(this.hello); }

Another possible use for the header is for suppressing linting errors as in #51. I would recommend excluding any tsPEG generated files from linting directly through the linter config (e.g. .eslintignore) but if you choose you can instead disable specific linting rules using the header.

Kind Checking

tsPEG uses discriminated unions to distinguish between types of AST nodes. Each AST node type has a field called kind which contains some value from the ASTKinds enum. This kind field can be used to differentiate between AST results.

Example

Choice := Word | Int
Word   := word='[a-z]+'
Int    := val='[0-9]+'

When writing a function to process the parse tree for this grammar, you can check if kind === ASTKinds.Word or kind === ASTKinds.Int to see which rule was matched.

import { Choice, Parser, ASTKinds } from  "./parser";

function makeChoice(c: Choice) {
    if(c.kind === ASTKinds.Word) {
        console.log('Matched word ', c.word)
    } else {
        console.log('Matched int ', c.val)
    }
}

Kind names

The names of the ASTKinds enum entries vary.

Position tracking

tsPEG supports a special match expression "@" for storing the positions of matches. Using @ you can store the current position of the parser.

The @ expression returns a PosInfo object:

export interface PosInfo {
    // overallPos is the index of the input string
    readonly overallPos: number;

    // line is the line of the input
    readonly line: number;

    // offset is the number of characters from the
    // start of the line
    readonly offset: number;
}

Example

Returning to our addition of two numbers from earlier, which matches "12+56", "123+456" etc.

sum := left=num '\+' right=num
num := '[0-9]+'

If we want to store the location of the "+" operator we can use the @ match like this:

sum := left=num pluspos=@ '\+' right=num
num := '[0-9]+'

This stores the position of the parser into the AST before the '+' symbol, the generated interface for the AST looks like:

interface sum {
    left: string
    pluspos: PosInfo
    right: string
}

EOF Matching

tsPEG supports another special match expression $ that matches the end of the input (traditionally referred to as EOF for End-Of-File).

The $ match will succeed only if we are at the end of the input. This can be used to ensure that all of the input is consumed.

Example

Consider this grammar:

hello := 'Hello World'

This grammar will successfully match the string "Hello World" as expected, but it will also match the string "Hello World, Goodbye Moon". It will match the "Hello World" at the start of sentence and not try to go any further.

No EOF Symbol

To fix this issue we can use the $ symbol to require that the parser only succeeds if the EOF has been reached.

hello := 'Hello World' $

If the EOF is not matched by the parser you will get an EOFMatch object in your expected matches in the returned SyntaxErr.

With EOF Symbol

Memoisation

If your generated parser speed is slow, it might be due to excessive backtracking being required to parse your input.

Memoisation can be enabled to solve this problem. By passing the flag --enable-memo to tsPEG, your generated parser will cache partial parsing results as it goes, which can drastically improve parse times, at the expense of increasing the parsers memory usage.