0x2a-42 / lelwel

Resilient LL(1) parser generator for Rust
Apache License 2.0
102 stars 0 forks source link
grammar parser parser-generator parsing rust

lelwel

Crates.io MIT/Apache 2.0 Crates.io Rust Playground

Table of Contents

Introduction

Lelwel (Language for Extended LL(1) parsing With Error resilience and Lossless syntax trees) generates recursive descent parsers for Rust using LL(1) grammars with extensions for direct left recursion, operator precedence, semantic predicates (which also enable arbitrary lookahead), and semantic actions (which allow to deal with semantic context sensitivity, e.g. type / variable name ambiguity in C).

The parser creates a homogeneous, lossless, concrete syntax tree (CST) that can be used to construct an abstract syntax tree (AST). Certain patterns are detected to avoid CST nodes for rules that only forward to other rules. Bindings can be defined in regexes to rename the CST node for certain parses.

The error recovery and tree construction is inspired by Alex Kladov's (matklad) Resilient LL Parsing Tutorial. Lelwel uses a (to my knowledge) novel heuristic to automatically calculate the recovery sets, by using the follow sets of the dominators in the directed graph induced by the grammar.

Lelwel is written as a library. It is used by the CLI tool llw, the language server lelwel-ls, and can be included as a build dependency in order to be called from a build.rs file. There is a plugin for Neovim that uses the language server.

By default the generated parser uses Logos for lexing and Codespan for diagnostics, however this is not mandatory.

Why Yet Another Parser Generator?

Why LL(1) and not a more general CFL or PEG parser?

Grammar Examples

The parser for lelwel grammar files (*.llw) is itself generated by lelwel. There are also examples for C without a preprocessor (actually resolves ambiguity with semantic context information, unlike examples for ANTLR4 and Tree-sitter), Lua, arithmetic expressions, JSON, and Oberon-0.

You can try out examples in the Lelwel Playground.

The following example shows a grammar for the toy language "L" introduced by the Resilient LL Parsing Tutorial.

token Fn='fn' Let='let' Return='return' True='true' False='false';
token Arrow='->' LPar='(' RPar=')' Comma=',' Colon=':' LBrace='{' RBrace='}'
      Semi=';' Asn='=' Plus='+' Minus='-' Star='*' Slash='/';
token Name='<name>' Int='<int>';
token Whitespace;

skip Whitespace;

start file;

file: fn*;
fn: 'fn' Name param_list ['->' type_expr] block;
param_list: '(' [param (?1 ',' param)* [',']] ')';
param: Name ':' type_expr;
type_expr: Name;
block: '{' stmt* '}';
stmt:
  stmt_expr
| stmt_let
| block
| stmt_return
;
stmt_expr: expr ';';
stmt_let: 'let' Name '=' expr ';';
stmt_return: 'return' [expr] ';';
expr: expr_bin;
expr_bin:
  expr_bin ('*' | '/') expr_bin
| expr_bin ('+' | '-') expr_bin
| expr_call
;
expr_call:
  expr_call arg_list
| expr_literal
| expr_name
| expr_paren
;
arg_list: '(' [expr (?1 ',' expr)* [',']] ')';
expr_literal: Int | 'true' | 'false';
expr_name: Name;
expr_paren: '(' expr ')';

Quickstart

  1. Write a grammar file and place it in the src directory of your crate. Optionally you can install the CLI or language server to validate your grammar file: cargo install --features=cli,lsp lelwel.
  2. Add the following to your Cargo.toml and build.rs files.

    [dependencies]
    logos = "0.14.0"
    codespan-reporting = "0.11.1"
    
    [build-dependencies]
    lelwel = "0.6.2"
    fn main() {
      lelwel::build("src/your_grammar.llw");
    }
  3. Start a build. This will create a parser.rs file next to your grammar file. The parser.rs file is supposed to be manually edited to implement the lexer and it includes the actual parser generated.rs, which is written to the Cargo OUT_DIR. If you change the grammar after the parser.rs file has been generated, it may be required to manually update the Token enum or the Parser impl for semantic predicates and actions.
  4. Use the parser module with the following minimal main.rs file for printing the CST and diagnostics.

    mod parser;
    
    use codespan_reporting::files::SimpleFile;
    use codespan_reporting::term::termcolor::{ColorChoice, StandardStream};
    use codespan_reporting::term::{self, Config};
    use logos::Logos;
    use parser::*;
    
    fn main() -> std::io::Result<()> {
       let args: Vec<String> = std::env::args().collect();
       if args.len() != 2 {
           std::process::exit(1);
       }
       let source = std::fs::read_to_string(&args[1])?;
       let mut diags = vec![];
       let (tokens, ranges) = tokenize(Token::lexer(&source), &mut diags);
       let cst = Parser::parse(&source, tokens, ranges, &mut diags);
       println!("{cst}");
       let writer = StandardStream::stderr(ColorChoice::Auto);
       let config = Config::default();
       let file = SimpleFile::new(&args[1], &source);
       for diag in diags.iter() {
           term::emit(&mut writer.lock(), &config, &file, diag).unwrap();
       }
       Ok(())
    }

Grammar Specification

Lelwel grammars are based on the formalism of context free grammars (CFG) and more specifically LL(1) grammars. There are certain extensions to the classical grammar syntax such as constructs similar to those from EBNF.

A grammar file consists of top level definitions which are independent of their order.

Token List

A token list definition introduces a list of tokens (terminals) to the grammar. It starts with the token keyword, ends with a ; and contains a list of token names and corresponding token symbols.

A token name must start with a capital letter. The token symbol is optional and delimited by single quotation marks. It is used in error messages and the generator of the parser.rs file. In a regex a token can be referenced by its name or symbol.

[!TIP] If the token symbol string starts with < and ends with >, the token is interpreted as a class of tokens for which the symbol is only a description. This influences how error messages and lexer rules are generated by default in parser.rs.

Example

token MyKeyword='my_keyword' Int='<integer literal>' True='true' False='false';

Rule

A grammar rule must start with a lower case letter. A regular expression is used to specify the right hand side of the rule.

The following special regex patterns for rules are recognized.

Regular Expressions

Regular expressions are built from the following syntactic constructs.

Start

A start definition specifies the start rule of the grammar. There must be exactly one start definition in a grammar. The start rule must not be referenced in a regex.

Example

start translation_unit;

Skip

A skip definition allows to specify a list of tokens, which are ignored by the parser. These tokens will however still be part of the syntax tree.

Example

skip Whitespace Comment;

Right

A right definition allows to specify a list of tokens, which are handled as right associative operators in operator precedence rules.

Example

right '^' '=';

License

Lelwel, its examples, and its generated code are licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.