Open raj-nimble opened 1 month ago
Hi @raj-nimble
Thanks. I like the idea a lot!
ATM, I don't really have the time for serde arrow. I will have a proper look next week!
First feedback: it's great, that you target the 0.12 release. I would like to limit the current main to bug fixes for the 0.11 release.
Re builders we can also figure it out together, once I'm back at my laptop. I imagine the deserialization logic will be a bit tricky. Maybe the impl of untagged enums in serde could be an inspiration (I like use the rust playground to expand the generated deserialize impl).
Hi @chmp that's great, I'm happy you like it. I will continue to iterate slowly on my own this week and I look forward to discussing it more with you next week.
Hi @chmp . Thanks so much for the comments! I haven't made any progress since last week as I was working on something else, but with your comments now I can hopefully circle back to this starting next week and I'll update you as soon as I have made some progress. Thanks!
Probably, I would implement this as serialization only for now.
Agreed, it's best to implement the feature as serialization only for now. Can add a separate issue to track adding deserialization.
Hi @chmp ,
I'm struggling to find an elegant way to do the serialization.
I started out attempting to modify StructBuilder to handle it but the issue becomes that the value being serialized is the actual Enum and it was difficult to attempt to detect that in ArrayBuilder, and deconstruct it into it's fields to call serialize on that.
If not detected, and we do something like trying to return the field ArrayBuilder, due to the serialization it ends up calling serialize_struct_field
on the various primitive builders and it didn't seem like the best idea to go through and implement that serialization method on all the primitive builders (although maybe that really is the best way?).
I considered redoing the schema creation to continue to use UnionBuilder and then add logic there to serialize it as a struct instead but I think that still ran into problems with arrow writer and seemed more convoluted.
I then considered creating an entirely new builder, something like an EnumStructBuilder
. I think that would move the complexity into OuterSequenceBuilder::new
, into the build_builder
and build_struct
inner methods but I'm not quite sure yet. There was some small complexity in how dispatch!
would choose to use this new builder which made it less elegant.
Do you have any time to take a closer look and give some guidance on exactly how this might best be done? My 2 current options seem to be to make a completely new builder, or use a small modification on StructBuilder which then adds struct field handling to all the primitive builders.
FYI, I rebased onto the latest from develop-0.12.
Thanks, Raj
@raj-nimble good questions, what's the best option.
One option would probably be to copy the code of UnionBuilder
into a separate file and adapt it:
into_array()
. Note that this would require to call serialize_none
on the unused buildersIn pseudo code:
// probably you need to do something different here, as you plan to use `Struct` as the type with a custom strategy
pub fn from_union_builder(builder: UnionBuilder) -> Result<FlattenedUnionBuilder> {
let builders = Vec::new();
for (child_builder, child_meta) in builder.fields {
// ensure the child builders are nullable here (otherwise `serialize_none` will fail)
let ArrayBuilder::Struct(child_builder) = child_builder else { fail!("Children must be structs") };
// modify the fields to include the variant prefix, the variant name will be in child_meta.name
}
// construct the builder
Ok(FlattenedUnionBuilder { ... })
}
pub fn into_array(self) -> Result<Array> {
let mut fields = Vec::new();
// note: the meta data for the variant is most likely not used, I would simply drop it
for (_, builder) in self.fields.into_iter().enumerate() {
let ArrayBuilder::Struct(builder) = builder else { fail!("..."); };
// concatenate the fields of the different variants
for (field, meta) in builder.fields {
fields.push((idx.try_into()?, builder.into_array()?, meta));
}
}
Ok(Array::Struct(StructArray {
// note: you will most likely need to keep track of the number of elements being added, simply add self.len += 1 or similar in the different `serialize_*` methods
len: self.len,
validity: self.seq.validity,
fields,
}))
}
pub fn serialize_variant(&mut self, variant_index: u32) -> Result<&mut ArrayBuilder> {
self.len += 1;
let variant_index = variant_index as usize;
// call push_none for any variant that was not selected
for (idx, builder) in self.fields.iter_mut().enumerate() {
if idx != variant_idx {
builder.serialize_none()?;
}
}
let Some(variant_builder) = self.fields.get_mut(variant_index) else {
fail!("Could not find variant {variant_index} in Union");
};
Ok(variant_builder)
}
This MR isn't complete, still need to add the correct functionality to array builder / sequence builder. This is just as an example of the direction I was going, was hoping for your thoughts?
With this change,
from_samples
outputs Fields with the desired flattened structure.