chmp / serde_arrow

Convert sequences of Rust objects to Arrow tables
https://docs.rs/serde_arrow/
MIT License
68 stars 21 forks source link

Deserialize from RecordBatch instead of &[Field] #137

Closed xmakro closed 7 months ago

xmakro commented 8 months ago

In the examples of serde_arrow, data from arrow is deserialized as follows:

use serde_arrow::schema::{TracingOptions, SerdeArrowSchema};
let fields = SerdeArrowSchema::from_type::<Record>(TracingOptions::new().allow_null_fields(true))
    .unwrap()
    .to_arrow_fields()
    .unwrap();
let arrays = serde_arrow::to_arrow(&fields, &records)?;

However, in many files we cannot rely that the order of the fields in the struct is the same as the order of the fields in the record batch. Therefore, we need to use the fields in the file. serde_arrow::from_arrow expects a &[Field]. The fields in the arrow Schema are stored as Vec<Arc<Field>>, so we have to copy out the fields before we can use them:

let fields = batch
    .schema()
    .fields
    .iter()
    .map(|x| (**x).clone())
    .collect_vec();
let arrays = serde_arrow::from_arrow::<Vec<Record>, _>(&fields, batch.columns())

This is inconvenient. It would be better if we could change the API of from_arrow to either accept a RecordBatch or change the type of fields to impl Iterator<Item = AsRef<Field>>.

chmp commented 8 months ago

You raise a valid point, that more often than no the order of the fields in the struct is not related to the fields in the record batch. At the moment I don't have a good idea how to handle this situation. I guess there would need to be some way to align the order of the fields between the record batch and the fields contained in the schema.

TBH. Using a RecordBatch as the sole input is probably not a good idea, as serde_arrow uses the fields to include additional meta data (e.g., to interpret Utf8 arrays as datetimes).

chmp commented 8 months ago

While thinking about this issue, I thought, why not use Vec::<Field>::from_type. As it turns out the schema serialization format of arrow and serde_arrow are different. To fix this, I started to implement the corresponding deserialization logic in #138. This will allow to use the following code (see here):

let fields = Vec::<Field>::from_value(&batch.schema())?;
let items: Vec<Record> = serde_arrow::from_arrow(&fields, batch.columns())?;

At the moment, only a limited subset of types is supported, but I will add the remaining types as well.

I still plan to add from_record_batch, to_record_batch functions to simplify arrow interop, but I am still thinking about the best interface.

chmp commented 8 months ago

Change the signature to use AsRef would require changing all APIs to accept Arc<Field> instead of Field: Field does not implement AsRef<Field>. However, this also seem to be the more idiomatic solution as the arrow API uses Arc<Field> throughout.

Change to Arc<Field> requires rewriting all macro tests (#122), as they currently use std::slice::from_ref that cannot be easily replaced by Arc<Field>.

chmp commented 7 months ago

Added a from_record_batch function on main.

xmakro commented 7 months ago

Looks great, thank you very much!

chmp commented 7 months ago

Released as serde_arrow=0.11.0