chmp / serde_arrow

Convert sequences of Rust objects to Arrow tables
MIT License
60 stars 17 forks source link

Feature/211 bool8 changelog #214

Closed chmp closed 1 month ago

v1gnesh commented 4 weeks ago

Hey, a qq - how do I use this via the derive attr, i.e., what do I define my field as?

EDIT: From the PRs, I see it's like this

/// ```rust
/// # use serde_json::json;
/// # use serde_arrow::{Result, schema::{SerdeArrowSchema, SchemaLike, TracingOptions, ext::Bool8Field}};
/// # use serde::Deserialize;
/// # fn main() -> Result<()> {
/// ##[derive(Deserialize)]
/// struct Record {
///     int_field: i32,
///     nested: Nested,
/// }
///
/// ##[derive(Deserialize)]
/// struct Nested {
///     bool_field: bool,
/// }
///
/// let tracing_options = TracingOptions::default()
///     .overwrite("nested.bool_field", Bool8Field::new("bool_field"))?;
///
/// let schema = SerdeArrowSchema::from_type::<Record>(tracing_options)?;

In order for this to be usable "transparently", I think arrow-rs needs to add a DataType impl for Bool8, right? Only then, serde_arrow will be able to support something like below?

#[derive(Debug, Serialize)]
struct Ye {
    a: Bool8,
}
chmp commented 4 weeks ago

The issue is the Serde data model and that there is, afaik, no way to add additional metadata.

You could write your own serialize, deserialize logic that goes over int8 (that would be, what your Bool8 type would do), but you still need to annotate the field with the Bool8 metadata to be correctly interpreted by other tools (eg. pyarrow).

I do have an idea around to address this issue overall, but it will take some time to implement. In the mean time, manually setting the type is the easiest way.

One idea for an optional serde arrow feature could also be to add a tracing option to convert all bool fields to Bool8.

v1gnesh commented 4 weeks ago

One idea for an optional serde arrow feature could also be to add a tracing option to convert all bool fields to Bool8.

Or at least for a newtype over [u8; T], where all items in this array are needed as Bool8.

pub struct Bool8Wrap<const T: usize>(pub [u8; T]);

initialized as:

pub struct BigOne {
    // The following 5 bytes need to be 5 Bool8Arrays
    a: Bool8Wrap<5>, 
}
chmp commented 3 weeks ago

@v1gnesh Thanks for sticking with it. The newtype idea is very promising. I haven't thought it through completely, but I can see the outlines of a solution. I started a discussion here