Open zhuliquan opened 3 weeks ago
The answer is two parts:
Assuming we can find a pathway to support (2) with the arrow-rs implementation (and it's reasonably complete/fast) we can move to that. The approach might look like what we already do for JSON in our arrow-rs fork: https://github.com/ArroyoSystems/arrow-rs/blob/52.1.0/json/arrow-json/src/reader/json_array.rs
Arroyo is a very good library, and we ran into some performance issues when using it, and we found that there were large-scale decoding operations, as shown below. I analyzed the code https://github.com/zhuliquan/arroyo/blob/776965ae9d6ee818595197288d5cca379c564368/crates/arroyo-formats/src/de.rs#L338-L355 We found The consumed Kafka data of AVRO is first converted to Avro
Value
, then to JsonValue
, then serialized to bytes, and finally to RecordBatch. I actually have a question here, why not just convert from avro to RecordBatch? The arrow-rs also support AVRO format (https://github.com/apache/arrow-rs/tree/master/arrow-avro).