chmp / serde_arrow

Convert sequences of Rust objects to Arrow tables
MIT License
60 stars 17 forks source link

Feature/221 serialize enums as flattened structs #222

Open raj-nimble opened 1 month ago

raj-nimble commented 1 month ago

This MR isn't complete, still need to add the correct functionality to array builder / sequence builder. This is just as an example of the direction I was going, was hoping for your thoughts?

With this change, from_samples outputs Fields with the desired flattened structure.

chmp commented 1 month ago

Hi @raj-nimble

Thanks. I like the idea a lot!

ATM, I don't really have the time for serde arrow. I will have a proper look next week!

First feedback: it's great, that you target the 0.12 release. I would like to limit the current main to bug fixes for the 0.11 release.

Re builders we can also figure it out together, once I'm back at my laptop. I imagine the deserialization logic will be a bit tricky. Maybe the impl of untagged enums in serde could be an inspiration (I like use the rust playground to expand the generated deserialize impl).

raj-nimble commented 1 month ago

Hi @chmp that's great, I'm happy you like it. I will continue to iterate slowly on my own this week and I look forward to discussing it more with you next week.

raj-nimble commented 3 weeks ago

Hi @chmp . Thanks so much for the comments! I haven't made any progress since last week as I was working on something else, but with your comments now I can hopefully circle back to this starting next week and I'll update you as soon as I have made some progress. Thanks!

raj-nimble commented 2 weeks ago

Probably, I would implement this as serialization only for now.

Agreed, it's best to implement the feature as serialization only for now. Can add a separate issue to track adding deserialization.

raj-nimble commented 3 days ago

Hi @chmp ,

I'm struggling to find an elegant way to do the serialization. I started out attempting to modify StructBuilder to handle it but the issue becomes that the value being serialized is the actual Enum and it was difficult to attempt to detect that in ArrayBuilder, and deconstruct it into it's fields to call serialize on that. If not detected, and we do something like trying to return the field ArrayBuilder, due to the serialization it ends up calling serialize_struct_field on the various primitive builders and it didn't seem like the best idea to go through and implement that serialization method on all the primitive builders (although maybe that really is the best way?).

I considered redoing the schema creation to continue to use UnionBuilder and then add logic there to serialize it as a struct instead but I think that still ran into problems with arrow writer and seemed more convoluted.

I then considered creating an entirely new builder, something like an EnumStructBuilder. I think that would move the complexity into OuterSequenceBuilder::new, into the build_builder and build_struct inner methods but I'm not quite sure yet. There was some small complexity in how dispatch! would choose to use this new builder which made it less elegant.

Do you have any time to take a closer look and give some guidance on exactly how this might best be done? My 2 current options seem to be to make a completely new builder, or use a small modification on StructBuilder which then adds struct field handling to all the primitive builders.

FYI, I rebased onto the latest from develop-0.12.

Thanks, Raj

chmp commented 3 days ago

@raj-nimble good questions, what's the best option.

One option would probably be to copy the code of UnionBuilder into a separate file and adapt it:

In pseudo code:

    // probably you need to do something different here, as you plan to use `Struct` as the type with a custom strategy
    pub fn from_union_builder(builder: UnionBuilder) -> Result<FlattenedUnionBuilder> {
        let builders = Vec::new();
        for (child_builder, child_meta) in builder.fields {
           // ensure the child builders are nullable here (otherwise `serialize_none` will fail)
           let ArrayBuilder::Struct(child_builder) = child_builder else { fail!("Children must be structs") };
           // modify the fields to include the variant prefix, the variant name will be in child_meta.name
        }

        // construct the builder
       Ok(FlattenedUnionBuilder { ... })
    }

    pub fn into_array(self) -> Result<Array> {
        let mut fields = Vec::new();

        // note: the meta data for the variant is most likely not used, I would simply drop it
        for (_, builder) in self.fields.into_iter().enumerate() {
            let ArrayBuilder::Struct(builder) = builder else { fail!("..."); };

            // concatenate the fields of the different variants
            for (field, meta) in builder.fields {
                fields.push((idx.try_into()?, builder.into_array()?, meta));
             }
        }

        Ok(Array::Struct(StructArray {
            // note: you will most likely need to keep track of the number of elements being added, simply add self.len += 1 or similar in the different `serialize_*` methods
            len: self.len,
            validity: self.seq.validity,
            fields,
        }))
    }

    pub fn serialize_variant(&mut self, variant_index: u32) -> Result<&mut ArrayBuilder> {
        self.len += 1;

        let variant_index = variant_index as usize;

        // call push_none for any variant that was not selected
        for (idx, builder) in self.fields.iter_mut().enumerate() {
           if idx != variant_idx {
               builder.serialize_none()?;
            }
         }

        let Some(variant_builder) = self.fields.get_mut(variant_index) else {
            fail!("Could not find variant {variant_index} in Union");
        };

        Ok(variant_builder)
    }