ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.8k stars 220 forks source link

support avro to record batch directly #768

Open zhuliquan opened 3 weeks ago

zhuliquan commented 3 weeks ago

Arroyo is a very good library, and we ran into some performance issues when using it, and we found that there were large-scale decoding operations, as shown below. image I analyzed the code https://github.com/zhuliquan/arroyo/blob/776965ae9d6ee818595197288d5cca379c564368/crates/arroyo-formats/src/de.rs#L338-L355 We found The consumed Kafka data of AVRO is first converted to Avro Value, then to Json Value, then serialized to bytes, and finally to RecordBatch. I actually have a question here, why not just convert from avro to RecordBatch? The arrow-rs also support AVRO format (https://github.com/apache/arrow-rs/tree/master/arrow-avro).

mwylde commented 3 weeks ago

The answer is two parts:

  1. When we built the avro support into Arroyo, the arrow-rs avro implementation was not complete enough to use so we took a bit of a shortcut with the avro-to-json approach
  2. It's not straightforward to support all avro features as SQL data types (for example, arbitrary unions), so today for any fields that have an unsupported data type, we use "raw json" encoding, where we re-encode those columns as JSON and make them available for querying with json functions. This allows us to support any avro schema.

Assuming we can find a pathway to support (2) with the arrow-rs implementation (and it's reasonably complete/fast) we can move to that. The approach might look like what we already do for JSON in our arrow-rs fork: https://github.com/ArroyoSystems/arrow-rs/blob/52.1.0/json/arrow-json/src/reader/json_array.rs