Open aymkhalil opened 2 years ago
This is a great idea. The thing to watch out for is the performance of turning Avro messages into field name/value pairs, in the case of Jackson this is the most expensive part of the matching task. Note the comments on https://www.tbray.org/ongoing/When/202x/2021/12/03/Filtering-Lessons where people say that it should be possible to do this very efficiently. Ideally it would be nice to have a Jackson-compatible tokenizer, which would allow for a lot of code re-use. Hey, check this out: https://github.com/FasterXML/jackson-dataformats-binary - this could be the basis for killing a lot of birds with one stone.
Thanks @timbray for the implementation hints! It is promising to see the possibility to add binary format support without compromising performance.
I like the idea of supporting jackson-dataformats-binary. Updated the title to highlight this. Code-change wise, I think this'll mean a new method in Event.java that creates a JsonParser [1] for the right format. We then bubble up the interface from various places clients can use (GenericMatchine, Ruler). Then we need to also add tests cases + benchmarks.
Just need to be sure that this supports the nextToken() API, not just ObjectMapper. Also need to watch out for schemas, most of these binary formats can't be parsed without accessing schemas. Will need to cache schemas where possible. CBOR doesn't need a schema. Avro relies on a 4-byte field in the Kafka wire format header that identifies the schema. Which is to say, this feature is going to need some design thinking, even if the implementation isn't that hard.
https://github.com/FasterXML/jackson-dataformats-binary seems to be extending JSONParser, so it "should" be supporting nextToken() but definitely worth a second look. There's some tests in the pkg implementing, so hopeful. I hadn't thought beyond this yet.
I find the idea of extending a Flattener interface from Quamina https://github.com/timbray/quamina/#flattening-and-matching. I'm hoping there's a similar interface we can have to allows for extension. This interface should be in addition for built-in support for Avro and other schemas.
Also 👋 Ayman, missed ya.
Just had a look at the Jackson Avro code. I was worried that it would implement nextToken() by deserializing into an object then traversing that, but it looks pretty efficient actually. Also, it looks pretty complicated, wouldn't want to implement one of these from scratch.
Also, hey there Ayman.
Seems nextToken() has huge code reusability advantage. I was wondering though if it limits the ability to 1) consult the schema if the pattern field exist in the event and 2) look it up in constant time). The O(1) lookup part should be doable in protobuf - not sure about Avro. This thing maybe beneficial when Pattern fields are << Event fields OR just premature optimization  ¯_(ツ)_/¯
fwiw I was actually doing research on event matching options for Pulsar - I stumbled upon few options like JMS selectors, JSTL, and even fully fledged SQL "WHERE" conditions! At the time, I wanted to also experiment with Ruler as it would beat other options performance-wise at least (but the Java version was not OSS. And, the non-aws community is very schema-full). SOOO, big thank you everyone behind this initiative!
Also, 👋 Rishi, 👋 Tim! Hope all is well!
What is your idea?
Support pattern matching on AVRO events. AVRO support is a reasonable next step because:
Would you be willing to make the change?
Maybe
Additional context
Message/streaming systems are lacking a killer pattern/expression/filter language - it could use a de facto "ruler pattern" language, just like SQL is for DBs.
User who choose to define schemas for their events, expect all interactions to respect schema. Having pattern matching respect data types and fields as defined by the "active or a previous" schema seems like a natural fit for those use cases.