aws / event-ruler

Event Ruler is a Java library that allows matching many thousands of Events per second to any number of expressive and sophisticated rules.
Apache License 2.0
566 stars 64 forks source link

Support events in AVRO and other formats by supporting jackson-dataformat-binary #36

Open aymkhalil opened 2 years ago

aymkhalil commented 2 years ago

What is your idea?

Support pattern matching on AVRO events. AVRO support is a reasonable next step because:

  1. It has wide adoption in streaming systems like Pulsar and Kafka. Any streaming system could use not only the performance characteristics of ruler (which are much needed) but also the semantics of pattern matching.
  2. Would serve as a good reference example/stepping stone implementation for other binary formats (or any events with formal schema for that matter).

Would you be willing to make the change?

Maybe

Additional context

timbray commented 2 years ago

This is a great idea. The thing to watch out for is the performance of turning Avro messages into field name/value pairs, in the case of Jackson this is the most expensive part of the matching task. Note the comments on https://www.tbray.org/ongoing/When/202x/2021/12/03/Filtering-Lessons where people say that it should be possible to do this very efficiently. Ideally it would be nice to have a Jackson-compatible tokenizer, which would allow for a lot of code re-use. Hey, check this out: https://github.com/FasterXML/jackson-dataformats-binary - this could be the basis for killing a lot of birds with one stone.

aymkhalil commented 2 years ago

Thanks @timbray for the implementation hints! It is promising to see the possibility to add binary format support without compromising performance.

baldawar commented 2 years ago

I like the idea of supporting jackson-dataformats-binary. Updated the title to highlight this. Code-change wise, I think this'll mean a new method in Event.java that creates a JsonParser [1] for the right format. We then bubble up the interface from various places clients can use (GenericMatchine, Ruler). Then we need to also add tests cases + benchmarks.

timbray commented 2 years ago

Just need to be sure that this supports the nextToken() API, not just ObjectMapper. Also need to watch out for schemas, most of these binary formats can't be parsed without accessing schemas. Will need to cache schemas where possible. CBOR doesn't need a schema. Avro relies on a 4-byte field in the Kafka wire format header that identifies the schema. Which is to say, this feature is going to need some design thinking, even if the implementation isn't that hard.

baldawar commented 2 years ago

https://github.com/FasterXML/jackson-dataformats-binary seems to be extending JSONParser, so it "should" be supporting nextToken() but definitely worth a second look. There's some tests in the pkg implementing, so hopeful. I hadn't thought beyond this yet.

I find the idea of extending a Flattener interface from Quamina https://github.com/timbray/quamina/#flattening-and-matching. I'm hoping there's a similar interface we can have to allows for extension. This interface should be in addition for built-in support for Avro and other schemas.

baldawar commented 2 years ago

Also 👋 Ayman, missed ya.

timbray commented 2 years ago

Just had a look at the Jackson Avro code. I was worried that it would implement nextToken() by deserializing into an object then traversing that, but it looks pretty efficient actually. Also, it looks pretty complicated, wouldn't want to implement one of these from scratch.

Also, hey there Ayman.

aymkhalil commented 2 years ago

Seems nextToken() has huge code reusability advantage. I was wondering though if it limits the ability to 1) consult the schema if the pattern field exist in the event and 2) look it up in constant time). The O(1) lookup part should be doable in protobuf - not sure about Avro. This thing maybe beneficial when Pattern fields are << Event fields OR just premature optimization  ¯_(ツ)_/¯

fwiw I was actually doing research on event matching options for Pulsar - I stumbled upon few options like JMS selectors, JSTL, and even fully fledged SQL "WHERE" conditions! At the time, I wanted to also experiment with Ruler as it would beat other options performance-wise at least (but the Java version was not OSS. And, the non-aws community is very schema-full). SOOO, big thank you everyone behind this initiative!

Also, 👋 Rishi, 👋 Tim! Hope all is well!