chmp / serde_arrow

Convert sequences of Rust objects to Arrow tables
https://docs.rs/serde_arrow/
MIT License
69 stars 21 forks source link

feature request: have a TracingOption to represent enums as flattened structs instead of unions #221

Open raj-nimble opened 3 months ago

raj-nimble commented 3 months ago

Hi Chris,

I wanted to propose/discuss a new feature, where Rust enums are flattened into a struct (or map?) with a wide schema, where the non-selected variants would have None/Null fields and then deserialized with some intelligent logic checking the fields for each variant.

This would be a workaround/crutch for the fact that Unions are not supported in parquet (and support doesn't appear to be coming any time soon), but ideally generic enough to be useful for anyone if they wish, and it seems like the feature would do well to live in this crate.

For the interface, I would hope it would be a simple option set in TracingOptions, e.g. TracingOptions::new().flatten_enums_into_structs(true).

I would like to try to help implement this if you agree with the feature but don't think you'll have time to work on it yourself, although obviously you would implement this much faster than I would. I notice in your branch activity you appear to be actively working on version 0.12 of the crate, where maybe you are thinking about this already? Or possibly want to delay a feature like this until the next version? Either way, would love to discuss possibilities. Please let me know your thoughts.

Thanks, Raj

raj-nimble commented 3 months ago

As an example, imagine we had the following rust enum

#[derive(Serialize, Deserialize)]
enum RecordEnum {
    Inside { room: String },
    Outside { street: String, zipcode: u16 },
}

I think given the option, we could map that to an equivalent flattened struct like the following in terms of the Arrow Field:

#[derive(Serialize, Deserialize)]
struct RecordEnumStruct {
    inside_room: Option<String>,
    outside_street: Option<String>,
    outside_zipcode: Option<u16>,
}

The field names are to prevent field name collisions. I have an outer record like this, holding both types:

#[derive(Serialize, Deserialize)]
struct Record {
    a: RecordEnum,
    b: RecordEnumStruct,
}

Represented as arrow Fields, instead of this:

    Field {
        name: "a",
        data_type: Union(
            [
                (
                    0,
                    Field {
                        name: "Inside",
                        data_type: Struct(
                            [
                                Field {
                                    name: "room",
                                    data_type: LargeUtf8,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                            ],
                        ),
                        nullable: false,
                        dict_id: 0,
                        dict_is_ordered: false,
                        metadata: {},
                    },
                ),
                (
                    1,
                    Field {
                        name: "Outside",
                        data_type: Struct(
                            [
                                Field {
                                    name: "street",
                                    data_type: LargeUtf8,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                                Field {
                                    name: "zipcode",
                                    data_type: UInt16,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                            ],
                        ),
                        nullable: false,
                        dict_id: 0,
                        dict_is_ordered: false,
                        metadata: {},
                    },
                ),
            ],
            Dense,
        ),
        nullable: false,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    }

We would auto-convert to this:

 Field {
        name: "b",
        data_type: Struct(
            [
                Field {
                    name: "inside_room",
                    data_type: LargeUtf8,
                    nullable: true,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: {},
                },
                Field {
                    name: "outside_street",
                    data_type: LargeUtf8,
                    nullable: true,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: {},
                },
                Field {
                    name: "outside_zipcode",
                    data_type: UInt16,
                    nullable: true,
                    dict_id: 0,
                    dict_is_ordered: false,
                    metadata: {},
                },
            ],
        ),
        nullable: false,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    }

When disabling the first type, I can now write parquet files just fine. If we can do this for the user automatically I think it could be quite useful.

raj-nimble commented 3 months ago

Initial draft MR https://github.com/chmp/serde_arrow/pull/222