egraphs-good / egglog

egraphs + datalog!
https://egraphs-good.github.io/egglog/
MIT License
458 stars 54 forks source link

Reimplement parser #450

Closed Alex-Fischman closed 3 weeks ago

Alex-Fischman commented 1 month ago

This PR implements a fully-Rust parser, which adds better error messages. This has the additional benefits of reducing dependencies and reducing variance in small benchmarks.

codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Merging #450 will not alter performance

Comparing Alex-Fischman:fast-parser (4c28061) with main (1f4603a)

Summary

✅ 8 untouched benchmarks

yihozhang commented 1 month ago

Thanks for this heroic effort! This is amazing! Two thoughts:

Alex-Fischman commented 1 month ago
  • Parser combinator is known to be asymptotically slower than bottom-up parsers like what lalrpop provides. However here we see speedups for several small to medium programs. Have you compared the two parsers with larger programs and see their performance?

These two commits speed up the slowest part of the combinator approach, which are the large multi-way choices. It fixes this by branching on the name of the command/action/schedule, which lets us skip attempting to parse them.

Alex-Fischman commented 1 month ago

This parser still doesn't have great error messages, but they are better in some cases. I'm marking as ready for review to get more eyes. It's at the point where I'd like to merge and incrementally improve error messages as bad cases come up (which we can do now because it's all in Rust).

thaliaarchi commented 4 weeks ago

It seems to me that this would be vastly simplified by parsing to a general S-expression structure, then resolving commands according to their identifiers. Parsing and semantic analysis would be decoupled and both would become more consistent.

For parsing: The grammar for a single S-expression would be something like sexp ::= "(" WS* IDENT (WS+ (":" IDENT WS+)? sexp)* ")" | literal (but with proper handling of whitespace; e.g., required between ident and integer literal, but not ident and paren). The code duplication for similar patterns is then eliminated and combinators wouldn't be necessary.

For semantic analysis: Each command would register its signature (e.g., arity, types, optional arguments, or whatever needed), then they'd be resolved by identifier. Since they'd be parsed to a single structure with spans, shared across all commands, then error reporting has far fewer cases. This would allow more invalid programs to be parsed, where, for example, the only problem is in arities or names of optional arguments, and the correct signature could be reported in the error.

If there is no lookahead, each top-level command could be parsed independently with allocations recycled, so the AST for the entire program does not need to be generated.

Alex-Fischman commented 4 weeks ago

I agree that parsing to S-expressions first does almost certainly improve the error reporting somewhat, although I'm not sure how much of a win it actually is; currently error reporting is pretty simple.

I'm not motivated enough to write two parsers for the same language in one week, but if someone does want to take a shot at it you can use my toy parser from a year ago as a reference/baseline/starting point. It parses a subset of egglog but the basic ideas are there. (Slice is just Span.)

I do think that merging this PR makes that job slightly easier, since this PR consolidates all of the parsing in one place and does some related cleanup (like removing Quote).

Alex-Fischman commented 4 weeks ago

An additional related Zach thought to put on people's radar: an egglog fuzzer would be nice, but it's the sort of thing that's perpetually low priority until it gets done.