Switch to Chumsky for parsing

hardbyte commented 1 year ago

Benefits of using chumsky for parsing:

Easier to read and modify than LALRPOP grammar
Much better error reporting and syntax assistance

High level plan:

[x] Migrate the parser in hardbyte/common-expression-language to use the same AST as this repository.
Implement parser for anything missing in https://github.com/hardbyte/common-expression-language
- [x] recursive members
- [x] &&, ||, in and ternary operations
- [x] Field Inits - https://github.com/hardbyte/common-expression-language/pull/8
- [x] byte strings - https://github.com/hardbyte/common-expression-language/pull/7
- [x] primitive type functions. e.g. '1'.double() or 10.double()
[ ] Resolve any remaining differences between the existing LALRPOP parser and chumsky one.
- [x] Switch this repo's integer implementations to 64bit. https://github.com/clarkmcc/cel-rust/pull/16
- [ ] Remove the DoubleMinus and DoubleNot. Should be e.g. Unary(Negative, Unary(Negative, member)). https://github.com/clarkmcc/cel-rust/pull/17
- [ ] others ?
[x] a small refactor PR in this repository to hide larlpop internals and expose a parser function and parser errors. https://github.com/clarkmcc/cel-rust/pull/12
[x] more parser tests to increase confidence that a replaced parser doesn't break anything. https://github.com/clarkmcc/cel-rust/pull/13
[ ] PR to add the chumsky parser as an option perhaps not exposed by default or an optional opt-in?
[ ] Improvements to error reporting from parser - consider https://github.com/zesterer/ariadne
[ ] Compare and decide whether to make chumsky parser the default/only parser

Do you want to keep both parsers? If so, how should the API work to pick between them? Assume it wouldn't be too tricky to add unsigned ints and un-escaped strings to the current lalrpop version?

clarkmcc commented 1 year ago

Do you want to keep both parsers?

I'd like to benchmark both and then decide from there. If we need to support both we can cross that bridge. I like the idea of only supporting one parser if possible so it might be worth a performance audit of the chumsky parser if it's significantly different then LALRPOP.

hardbyte commented 1 year ago

Yeah that sounds like a good idea. If it isn't too hard to process textproto files from rust we could try benchmarking/testing with the test data.

clarkmcc commented 1 year ago

Good idea. I think we can use this function to parse those files. I'll need to do some digging to get the full schema of those types, but that shouldn't be difficult to find.

hardbyte commented 1 year ago

I'll make another PR to remove the DoubleMinus and DoubleNot, and then I'm down to 2 failing tests with the chumsky parser. 🤞🏼

inikolaev commented 1 year ago

Somewhat related to testing topic mentioned here I tried to reuse Gerkin files from here to run some conformance tests and it works pretty well using the cucumber-rs crate.

I had to make some changes expected output in the feature files, but it more or less works.

There are some issues with recognising hex and unsigned integers, so I had to adjust the grammar to be able to parse them correctly:

-    r"-?[0-9]+" => Atom::Int(<>.parse().unwrap()),
-    r"-?0[xX]([0-9a-fA-F]+)" => Atom::Int(i64::from_str_radix(<>, 16).unwrap()),
-    r"-?[0-9]+ [uU]" => Atom::UInt(<>.parse().unwrap()),
-    r"-?0[xX]([0-9a-fA-F]+) [uU]" => Atom::UInt(u64::from_str_radix(<>, 16).unwrap()),
+    r"[-+]?[0-9]+" => Atom::Int(<>.parse().unwrap()),
+    r"([-+])?0[xX]([0-9a-fA-F]+)" => {
+        // We cannot extract capture groups, see https://github.com/lalrpop/lalrpop/issues/575
+        let re = Regex::new(r"([-+])?0[xX]([0-9a-fA-F]+)").unwrap();
+        let captures = re.captures(<>).unwrap();
+
+        let sign = match captures.get(1) {
+            Some(v) => v.as_str(),
+            _ => "",
+        };
+
+        Atom::Int(
+            i64::from_str_radix(
+                format!("{}{}", sign, captures.get(2).unwrap().as_str()).as_str(), 16
+            ).unwrap()
+        )
+    },
+    r"([0-9]+)[uU]" => {
+        // We cannot extract capture groups, see https://github.com/lalrpop/lalrpop/issues/575
+        let re = Regex::new(r"([0-9]+)[uU]").unwrap();
+        let captures = re.captures(<>).unwrap();
+
+        Atom::UInt(captures.get(1).unwrap().as_str().parse().unwrap())
+    },
+    r"0[xX]([0-9a-fA-F]+)[uU]" => {
+        // We cannot extract capture groups, see https://github.com/lalrpop/lalrpop/issues/575
+        let re = Regex::new(r"0[xX]([0-9a-fA-F]+)[uU]").unwrap();
+        let captures = re.captures(<>).unwrap();
+
+        Atom::UInt(u64::from_str_radix(captures.get(1).unwrap().as_str(), 16).unwrap())
+    },

But there are other issues as well related to parsing unicode codes like \U0001f431, but maybe there's an existing library which can handle those.

The code for the test itself is pretty simple, but it does not support all features yet:

use cel_interpreter::{Context, Program};
use cucumber::{given, then, when, World};

// `World` is your shared, likely mutable state.
// Cucumber constructs it via `Default::default()` for each scenario.
#[derive(Debug, Default, World)]
pub struct CelWorld {
    expression: String,
}

#[when(expr = "CEL expression \"{word}\" is evaluated")]
fn expression_evaluated_with_double_quotes(world: &mut CelWorld, expression: String) {
    world.expression = expression;
}

#[when(expr = "CEL expression '{word}\' is evaluated")]
fn expression_evaluated_with_single_quotes(world: &mut CelWorld, expression: String) {
    world.expression = expression;
}

#[then(regex = "value is (.*)")]
fn evaluation_result_is(world: &mut CelWorld, expected_result: String) {
    let program = Program::compile(&world.expression).unwrap();
    let mut context = Context::default();
    let result = program.execute(&context);

    assert_eq!(expected_result, format!("{:?}", result));
}

// This runs before everything else, so you can setup things here.
fn main() {
    // You may choose any executor you like (`tokio`, `async-std`, etc.).
    // You may even have an `async` main, it doesn't matter. The point is that
    // Cucumber is composable. :)
    futures::executor::block_on(CelWorld::run("tests/features"));
}

and the output looks like this

Feature: basic
  Scenario: self_eval_int_zero
   ✔  When CEL expression "0" is evaluated
   ✔  Then value is Ok(Int(0))
  Scenario: self_eval_uint_zero
   ✔  When CEL expression "0u" is evaluated
   ✔  Then value is Ok(UInt(0))
  Scenario: self_eval_float_zero
   ✔  When CEL expression "0.0" is evaluated
   ✔  Then value is Ok(Float(0.0))
  Scenario: self_eval_float_zerowithexp
   ✔  When CEL expression "0e+0" is evaluated
   ✔  Then value is Ok(Float(0.0))
  Scenario: self_eval_string_empty
   ✔  When CEL expression "''" is evaluated
   ✔  Then value is Ok(String(""))
  Scenario: self_eval_string_empty_quotes
   ✔  When CEL expression '""' is evaluated
   ✔  Then value is Ok(String(""))
  Scenario: self_eval_bytes_empty
   ✔  When CEL expression 'b""' is evaluated
   ✔  Then value is Ok(Bytes([]))
  Scenario: self_eval_bool_false
   ✔  When CEL expression "false" is evaluated
   ✔  Then value is Ok(Bool(false))
  Scenario: self_eval_null
   ✔  When CEL expression "null" is evaluated
   ✔  Then value is Ok(Null)
  Scenario: self_eval_empty_list
   ✔  When CEL expression "[]" is evaluated
   ✔  Then value is Ok(List([]))
  Scenario: self_eval_empty_map
   ✔  When CEL expression "{}" is evaluated
   ✔  Then value is Ok(Map(Map { map: {} }))
  Scenario: self_eval_int_nonzero
   ✔  When CEL expression "42" is evaluated
   ✔  Then value is Ok(Int(42))
  Scenario: self_eval_uint_nonzero
   ✔  When CEL expression "123456789u" is evaluated
   ✔  Then value is Ok(UInt(123456789))
  Scenario: self_eval_int_negative_min
   ✔  When CEL expression "-9223372036854775808" is evaluated
   ✔  Then value is Ok(Int(-9223372036854775808))
  Scenario: self_eval_float_negative_exp
   ✔  When CEL expression "-2.3e+1" is evaluated
   ✔  Then value is Ok(Float(-23.0))
  Scenario: self_eval_string_excl
   ✔  When CEL expression '"!"' is evaluated
   ✔  Then value is Ok(String("!"))
  Scenario: self_eval_string_escape
   ✔  When CEL expression "'\''" is evaluated
   ✘  Then value is Ok(String("'"))
      Step failed:
      Defined: tests/features/basic.feature:125:5
      Matched: interpreter/tests/cel.rs:21:1
      Step panicked. Captured output: assertion failed: `(left == right)`
        left: `"Ok(String(\"'\"))"`,
       right: `"Ok(String(\"\\\\'\"))"`
  Scenario: self_eval_bytes_escape
   ✔  When CEL expression "b'ÿ'" is evaluated
   ✔  Then value is Ok(Bytes([195, 191]))
  Scenario: self_eval_bytes_invalid_utf8
   ✔  When CEL expression "b'\000\xff'" is evaluated
   ✘  Then value is BytesType(source=b'\x00\xff')
      Step failed:
      Defined: tests/features/basic.feature:139:5
      Matched: interpreter/tests/cel.rs:21:1
      Step panicked. Captured output: assertion failed: `(left == right)`
        left: `"BytesType(source=b'\\x00\\xff')"`,
       right: `"Ok(Bytes([92, 48, 48, 48, 92, 120, 102, 102]))"`
  Scenario: self_eval_list_singleitem
   ✔  When CEL expression "[-1]" is evaluated
   ✔  Then value is Ok(List([Int(-1)]))
  Scenario: self_eval_map_singleitem
   ✔  When CEL expression '{"k":"v"}' is evaluated
   ✔  Then value is Ok(Map(Map { map: {String("k"): String("v")} }))
  Scenario: self_eval_bool_true
   ✔  When CEL expression "true" is evaluated
   ✔  Then value is Ok(Bool(true))
  Scenario: self_eval_int_hex
   ✔  When CEL expression "0x55555555" is evaluated
   ✔  Then value is Ok(Int(1431655765))
  Scenario: self_eval_int_hex_negative
   ✔  When CEL expression "-0x55555555" is evaluated
   ✔  Then value is Ok(Int(-1431655765))
  Scenario: self_eval_uint_hex
   ✔  When CEL expression "0x55555555u" is evaluated
   ✔  Then value is Ok(UInt(1431655765))
  Scenario: self_eval_unicode_escape_four
   ✔  When CEL expression '"\u270c"' is evaluated
   ✘  Then value is Ok(String("✌"))
      Step failed:
      Defined: tests/features/basic.feature:188:5
      Matched: interpreter/tests/cel.rs:21:1
      Step panicked. Captured output: assertion failed: `(left == right)`
        left: `"Ok(String(\"✌\"))"`,
       right: `"Ok(String(\"\\\\u270c\"))"`
  Scenario: self_eval_unicode_escape_eight
   ✔  When CEL expression '"\U0001f431"' is evaluated
   ✘  Then value is Ok(String("🐱"))
      Step failed:
      Defined: tests/features/basic.feature:195:5
      Matched: interpreter/tests/cel.rs:21:1
      Step panicked. Captured output: assertion failed: `(left == right)`
        left: `"Ok(String(\"🐱\"))"`,
       right: `"Ok(String(\"\\\\U0001f431\"))"`
  Scenario: self_eval_ascii_escape_seq
   ✔  When CEL expression '"\a\b\f\n\r\t\v\"\'\\"' is evaluated
   ✘  Then value is String(\x07\x08\x0c\n\r\t\x0b"\'\\)
      Step failed:
      Defined: tests/features/basic.feature:202:5
      Matched: interpreter/tests/cel.rs:21:1
      Step panicked. Captured output: assertion failed: `(left == right)`
        left: `"String(\\x07\\x08\\x0c\\n\\r\\t\\x0b\"\\'\\\\)"`,
       right: `"Ok(String(\"\\\\a\\\\b\\\\f\\\\n\\\\r\\\\t\\\\v\\\\\\\"\\\\'\\\\\\\\\"))"`
  Scenario: self_eval_bound_lookup
   ?  Given type_env parameter "x" is TypeType(value='INT64')
      Step skipped: tests/features/basic.feature:211:4
  Scenario: self_eval_unbound_lookup
   ✔  When CEL expression "x" is evaluated
   ?  Then eval_error is "undeclared reference to 'x' (in container '')"
      Step skipped: tests/features/basic.feature:225:5
  Scenario: unbound_is_runtime_error
   ?  When CEL expression "x || true" is evaluated
      Step skipped: tests/features/basic.feature:230:5
  Scenario: binop
   ?  When CEL expression "1 + 1" is evaluated
      Step skipped: tests/features/basic.feature:240:5
  Scenario: unbound
   ✔  When CEL expression "f_unknown(17)" is evaluated
   ?  Then eval_error is 'unbound function'
      Step skipped: tests/features/basic.feature:249:5
  Scenario: unbound_is_runtime_error
   ?  When CEL expression "f_unknown(17) || true" is evaluated
      Step skipped: tests/features/basic.feature:254:5
  Scenario: false
   ?  Given type_env parameter "false" is TypeType(value='BOOL')
      Step skipped: tests/features/basic.feature:265:4
  Scenario: true
   ?  Given type_env parameter "true" is TypeType(value='BOOL')
      Step skipped: tests/features/basic.feature:278:4
  Scenario: null
   ?  Given type_env parameter "null" is TypeType(value='BOOL')
      Step skipped: tests/features/basic.feature:291:4
[Summary]
1 feature
37 scenarios (23 passed, 9 skipped, 5 failed)
67 steps (53 passed, 9 skipped, 5 failed)

clarkmcc commented 1 year ago

@inikolaev this is fantastic! Definitely makes sense to incorporate into the project. I took a look at the proto-based tests but read an announcement from the maintainer about deprecating the protobuf dependency so I stopped pursuing that. This looks great however.

clarkmcc / cel-rust

Switch to Chumsky for parsing #14