chmp / serde_arrow

Convert sequences of Rust objects to Arrow tables
https://docs.rs/serde_arrow/
MIT License
69 stars 21 forks source link

Crossovers #66

Closed v1gnesh closed 1 year ago

v1gnesh commented 1 year ago

Hi,

Firstly, thank you for building this in the open & sharing!

I see that this can be used to serde-derivable structures to the arrow layout.

There are a ways to parse binary content into Rust data types. Additionally, there is https://github.com/simd-lite/simd-json-derive for deriving JSON from Rust data types.

Would I be able to convert a bunch of structs "created" by them, and then use serde_arrow's derive on top of that, to convert it finally to the arrow layout?

chmp commented 1 year ago

Hey,

Thanks for the kind words!

The crates you mention are really cool indeed. At first glance they do not seem to offer this split between data and format as serde does. So I do not see an obvious way to convert from deku / binrw data to arrow directly.

If you're talking about deku / binrw -> Rust -> arrow, then sure: you can use serde_arrow as is, you just need to specify the schema of your objects. Either by tracing a couple of examples using serialize_to_fields or by building the schema yourself. Then you use deku / binrw to construct the Rust objects and use serde_arrow to build the arrow arrays that correspond from these objects.

v1gnesh commented 1 year ago

Thank you, yeah I mean this option -- deku / binrw -> Rust -> arrow. If you have time, could you share an example of how I'd go about doing this. I'm pretty noob-ish with programming in general. My use case has a whole bunch of nested struct types, of binary log data.

Will post about the first method in those 2 projects and see what they think..

chmp commented 1 year ago

With serde_arrow, you have to ensure all your types implement serialize / deserialize, i.e., by using serde's derive macros. Then you can simply follow the example in the readme:

  1. trace the fields (i.e., determine the schema of your arrays): let fields = serialize_into_fields(&items, TracingOptions::default())?;
  2. construct the arrays let arrays = serialize_into_arrays(&fields, &items)?;

Important for step 1: if you have enums and lists you must make sure all lists have at leas a single entry and all relevant enum variants are encountered.

If you control the whole code base, maybe also arrow2-convert would be an option. You can easily convert from arrow2 to arrow.

chmp commented 1 year ago

Closing this issue, as there is no change necessary in serde_arrow as far as I can tell.