chmp / serde_arrow

Convert sequences of Rust objects to Arrow tables
MIT License
60 stars 17 forks source link

Cannot deserialize Uuid #203

Closed raj-nimble closed 1 month ago

raj-nimble commented 1 month ago

I am having an issue using serde_arrow to extract Fields from structs containing Uuid.

When I try to create fields from a record containing a Uuid type, the program panics with the following error:

thread 'main' panicked at src/main.rs:35:10:
failed to get schema: <Error: serde::de::Error: UUID parsing failed: invalid length: expected length 32 for simple format, found 0
Backtrace:
   0: serde_arrow::internal::error::Error::custom
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/error.rs:30:24
   1: <serde_arrow::internal::error::Error as serde::de::Error>::custom
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/error.rs:129:9
   2: uuid::external::serde_support::<impl serde::de::Deserialize for uuid::Uuid>::deserialize::de_error
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/uuid-1.10.0/src/external/serde_support.rs:60:13
   3: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
   4: core::result::Result<T,E>::map_err
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:829:27
   5: <uuid::external::serde_support::<impl serde::de::Deserialize for uuid::Uuid>::deserialize::UuidVisitor as serde::de::Visitor>::v
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/uuid-1.10.0/src/external/serde_support.rs:74:21
   6: serde::de::Visitor::visit_borrowed_str
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.204/src/de/mod.rs:1515:9
   7: <serde_arrow::internal::schema::from_type::TraceAny as serde::de::Deserializer>::deserialize_str
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
   8: uuid::external::serde_support::<impl serde::de::Deserialize for uuid::Uuid>::deserialize
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/uuid-1.10.0/src/external/serde_support.rs:109:13
   9: <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.204/src/de/mod.rs:801:9
  10: <serde_arrow::internal::schema::from_type::TraceStruct as serde::de::MapAccess>::next_value_seed
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
  11: serde::de::MapAccess::next_value
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.204/src/de/mod.rs:1874:9
  12: <purr::_::<impl serde::de::Deserialize for purr::Record>::deserialize::__Visitor as serde::de::Visitor>::visit_map
             at ./src/main.rs:7:21
  13: <serde_arrow::internal::schema::from_type::TraceAny as serde::de::Deserializer>::deserialize_struct
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
  14: purr::_::<impl serde::de::Deserialize for purr::Record>::deserialize
             at ./src/main.rs:7:21
  15: serde_arrow::internal::schema::from_type::<impl serde_arrow::internal::schema::tracer::Tracer>::from_type
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
  16: <serde_arrow::internal::schema::SerdeArrowSchema as serde_arrow::internal::schema::SchemaLike>::from_type
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/mod.rs:304:9
  17: serde_arrow::arrow_impl::schema::<impl serde_arrow::internal::schema::SchemaLike for alloc::vec::Vec<alloc::sync::Arc<arrow_sche
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/arrow_impl/schema.rs:140:12
  18: purr::main
             at ./src/main.rs:34:18
  19: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  20: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  21: std::rt::lang_start::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:166:18
  22: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:284:13
  23: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  24: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  25: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  26: std::rt::lang_start_internal::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:48
  27: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  28: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  29: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  30: std::rt::lang_start_internal
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  31: std::rt::lang_start
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:165:17
  32: main
  33: __libc_start_main
             at /build/glibc-LcI20x/glibc-2.31/csu/../csu/libc-start.c:308:16
  34: _start
>
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1654:5
   3: core::result::Result<T,E>::expect
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1034:23
   4: purr::main
             at ./src/main.rs:34:18
   5: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5

Here is a reproducible example that is extended from the example given in the serde_arrow crates.io page.

use arrow::datatypes::FieldRef;
use parquet::arrow::ArrowWriter;
use serde::{Deserialize, Serialize};
use serde_arrow::schema::{SchemaLike, TracingOptions};
use uuid::Uuid;

#[derive(Serialize, Deserialize)]
struct Record {
    a: f32,
    b: i32,
    c: Uuid,
}

fn main() {
    let records = vec![
        Record {
            a: 1.0,
            b: 1,
            c: Uuid::new_v4(),
        },
        Record {
            a: 2.0,
            b: 2,
            c: Uuid::new_v4(),
        },
        Record {
            a: 3.0,
            b: 3,
            c: Uuid::new_v4(),
        },
    ];

    // Determine Arrow schema
    let fields = Vec::<FieldRef>::from_type::<Record>(TracingOptions::default())
        .expect("failed to get schema");

    // Build a record batch
    let batch =
        serde_arrow::to_record_batch(&fields, &records).expect("failed to build record batch");

    let file = std::fs::File::create("example.pq").expect("failed to create parquet file");
    let mut writer =
        ArrowWriter::try_new(file, batch.schema(), None).expect("failed to create writer");

    writer.write(&batch).expect("failed to write batch");
    writer.close().expect("failed to close writer");
}

The Cargo.toml dependencies I used for this were

[dependencies]
arrow = "52.2.0"
parquet = "52.2.0"
serde = { version = "1.0.204", features = ["derive"] }
serde_arrow = { version = "0.11.6", features = ["arrow-52"] }
uuid = { version = "1.10.0", features = ["serde", "v4"] }

My rust version info

$ rustup --version
rustup 1.27.1 (54dd3d00f 2024-04-24)
info: This is the version for the rustup toolchain manager, not the rustc compiler.
info: The currently active `rustc` version is `rustc 1.78.0 (9b00956e5 2024-04-29)`
chmp commented 1 month ago

Thanks a lot for the detailed report. That's super helpful. I will have a look and figure out what's going on!

raj-nimble commented 1 month ago

From a brief look through, I think what might be going on is, trying to get the FieldRef by deserializing from_type figures out what the serde internal data model is for that type then calls deserialize on it to get the default representation of it for serde_arrow to use, but the internal data model for Uuid is String, and so it's deserializing a default string which is actually not valid for a Uuid. Actually, it might be deserializing as a borrowed str, so it's not even calling Default::default() in the deserializer, it's actually just passing in an empty string literal.

So I think I can work around this by getting the field types using from_samples, I hope. I'll report back if I find more.

raj-nimble commented 1 month ago

FYI, using from_samples does work. It's not a perfect solution but it does mean I'm not blocked for the time being. It would be great if from_type eventually worked.

One drawback (this may in fact be totally separate and not specific to using from_samples) is that the resulting parquet file shows a uuid as a str.

>>> import polars as pl
>>> pl.read_parquet("example.pq")
shape: (3, 3)
┌─────┬─────┬─────────────────────────────────┐
│ a   ┆ b   ┆ c                               │
│ --- ┆ --- ┆ ---                             │
│ f32 ┆ i32 ┆ str                             │
╞═════╪═════╪═════════════════════════════════╡
│ 1.0 ┆ 1   ┆ 60ea4638-3956-44a5-a258-845c70… │
│ 2.0 ┆ 2   ┆ c986c97b-a9a5-4df3-887d-8ac5e5… │
│ 3.0 ┆ 3   ┆ 5fe544b6-ba04-4e34-a41e-ec5826… │
└─────┴─────┴─────────────────────────────────┘

While not a major issue at all, it would be nice if somehow it deserialized into a python UUID object. I'm mentioning this here but I realize it likely is outside the scope of this crate.

chmp commented 1 month ago

@raj-nimble Thanks for the investigation. From what I understand, your explanation is spot on. I am afraid though there is not much that can be done about it. from_type only sees the type without further information, so it knows the object is of type "String", but cannot know anything about the content.

Re. serializing the type as UUID: it seems there is a canonical extension type for UUID. It would definitely make sense to add some functionality to simplify dealing with UUIDs. It is currently not supported by pyarrow, though. Oh. And polars does not have support for extension types currently, so resulting files would not readable from polars, as far as I understand.

Finally, thanks to your report, I also realized that the "human_readable" flag is inconsistently set throughout the crate. I think, it should be false throughout, as I would expect this to result in smaller data files throughout. In your case (UUIDs), this would result the data to be serialized as Bytes. Probably less readable than it is now, but much smaller file sizes (and it would be compatible with the UUID extension type).

I think the following changes would make sense:

chmp commented 1 month ago

I added a warning on using from_type for Uuid to the docs with #206. I am afraid due to the way Serde works, there is not much else that I can do here.

Thanks a lot for bringing this issue to my attention!

raj-nimble commented 1 month ago

Thanks @chmp for the follow up.