Closed raj-nimble closed 1 month ago
Thanks a lot for the detailed report. That's super helpful. I will have a look and figure out what's going on!
From a brief look through, I think what might be going on is, trying to get the FieldRef by deserializing from_type figures out what the serde internal data model is for that type then calls deserialize on it to get the default representation of it for serde_arrow to use, but the internal data model for Uuid is String, and so it's deserializing a default string which is actually not valid for a Uuid. Actually, it might be deserializing as a borrowed str, so it's not even calling Default::default()
in the deserializer, it's actually just passing in an empty string literal.
So I think I can work around this by getting the field types using from_samples
, I hope. I'll report back if I find more.
FYI, using from_samples
does work. It's not a perfect solution but it does mean I'm not blocked for the time being. It would be great if from_type
eventually worked.
One drawback (this may in fact be totally separate and not specific to using from_samples
) is that the resulting parquet file shows a uuid as a str.
>>> import polars as pl
>>> pl.read_parquet("example.pq")
shape: (3, 3)
┌─────┬─────┬─────────────────────────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f32 ┆ i32 ┆ str │
╞═════╪═════╪═════════════════════════════════╡
│ 1.0 ┆ 1 ┆ 60ea4638-3956-44a5-a258-845c70… │
│ 2.0 ┆ 2 ┆ c986c97b-a9a5-4df3-887d-8ac5e5… │
│ 3.0 ┆ 3 ┆ 5fe544b6-ba04-4e34-a41e-ec5826… │
└─────┴─────┴─────────────────────────────────┘
While not a major issue at all, it would be nice if somehow it deserialized into a python UUID object. I'm mentioning this here but I realize it likely is outside the scope of this crate.
@raj-nimble Thanks for the investigation. From what I understand, your explanation is spot on. I am afraid though there is not much that can be done about it. from_type
only sees the type without further information, so it knows the object is of type "String", but cannot know anything about the content.
Re. serializing the type as UUID: it seems there is a canonical extension type for UUID. It would definitely make sense to add some functionality to simplify dealing with UUIDs. It is currently not supported by pyarrow, though. Oh. And polars does not have support for extension types currently, so resulting files would not readable from polars, as far as I understand.
Finally, thanks to your report, I also realized that the "human_readable" flag is inconsistently set throughout the crate. I think, it should be false throughout, as I would expect this to result in smaller data files throughout. In your case (UUIDs), this would result the data to be serialized as Bytes
. Probably less readable than it is now, but much smaller file sizes (and it would be compatible with the UUID extension type).
I think the following changes would make sense:
from_type
in its docsI added a warning on using from_type
for Uuid
to the docs with #206. I am afraid due to the way Serde works, there is not much else that I can do here.
Thanks a lot for bringing this issue to my attention!
Thanks @chmp for the follow up.
I am having an issue using serde_arrow to extract Fields from structs containing
Uuid
.When I try to create fields from a record containing a Uuid type, the program panics with the following error:
Here is a reproducible example that is extended from the example given in the serde_arrow crates.io page.
The Cargo.toml dependencies I used for this were
My rust version info