support avro to record batch directly

ArroyoSystems / arroyo

Distributed stream processing engine in Rust

Apache License 2.0

3.8k stars 220 forks source link

The answer is two parts:

When we built the avro support into Arroyo, the arrow-rs avro implementation was not complete enough to use so we took a bit of a shortcut with the avro-to-json approach
It's not straightforward to support all avro features as SQL data types (for example, arbitrary unions), so today for any fields that have an unsupported data type, we use "raw json" encoding, where we re-encode those columns as JSON and make them available for querying with json functions. This allows us to support any avro schema.

Assuming we can find a pathway to support (2) with the arrow-rs implementation (and it's reasonably complete/fast) we can move to that. The approach might look like what we already do for JSON in our arrow-rs fork: https://github.com/ArroyoSystems/arrow-rs/blob/52.1.0/json/arrow-json/src/reader/json_array.rs

ArroyoSystems / arroyo