Closed xmakro closed 7 months ago
You raise a valid point, that more often than no the order of the fields in the struct is not related to the fields in the record batch. At the moment I don't have a good idea how to handle this situation. I guess there would need to be some way to align the order of the fields between the record batch and the fields contained in the schema.
TBH. Using a RecordBatch
as the sole input is probably not a good idea, as serde_arrow
uses the fields to include additional meta data (e.g., to interpret Utf8
arrays as datetimes).
While thinking about this issue, I thought, why not use Vec::<Field>::from_type
. As it turns out the schema serialization format of arrow
and serde_arrow
are different. To fix this, I started to implement the corresponding deserialization logic in #138. This will allow to use the following code (see here):
let fields = Vec::<Field>::from_value(&batch.schema())?;
let items: Vec<Record> = serde_arrow::from_arrow(&fields, batch.columns())?;
At the moment, only a limited subset of types is supported, but I will add the remaining types as well.
I still plan to add from_record_batch
, to_record_batch
functions to simplify arrow interop, but I am still thinking about the best interface.
Change the signature to use AsRef
would require changing all APIs to accept Arc<Field>
instead of Field
: Field
does not implement AsRef<Field>
. However, this also seem to be the more idiomatic solution as the arrow API uses Arc<Field>
throughout.
Change to Arc<Field>
requires rewriting all macro tests (#122), as they currently use std::slice::from_ref
that cannot be easily replaced by Arc<Field>
.
Added a from_record_batch
function on main.
Looks great, thank you very much!
Released as serde_arrow=0.11.0
In the examples of
serde_arrow
, data from arrow is deserialized as follows:However, in many files we cannot rely that the order of the fields in the struct is the same as the order of the fields in the record batch. Therefore, we need to use the fields in the file.
serde_arrow::from_arrow
expects a&[Field]
. The fields in the arrowSchema
are stored asVec<Arc<Field>>
, so we have to copy out the fields before we can use them:This is inconvenient. It would be better if we could change the API of
from_arrow
to either accept aRecordBatch
or change the type offields
toimpl Iterator<Item = AsRef<Field>>
.