Inconsistency between serialize and deserialize in Serde

chmp commented 2 hours ago

jiff uses inconsistent Serde types in serialization / deserialization. jiff types

generally serialize to string (serializer.collect_str()) (example)
but request bytes in deserialization (deserializer.deserialize_bytes(visitor)) (example)

It's not a huge issue, but I wanted to give a heads up: this inconsistency tripped me up in a couple of places, while adding support for jiff to my crate. I am now making my own impl more robust, by adding a deserialize_bytes impl to every deserialize_str impl :)

BurntSushi commented 2 hours ago

Sorry, I'm not understanding the problem. Does this lead to an actual issue in your code? Like it makes sense that these would be different at a conceptual level. Serialization is guaranteed to produce valid UTF-8, so it serializes a string. But deserialization can be done on a &[u8], which may not be valid UTF-8.

chmp commented 2 hours ago

Not really an issue: I simply need to add deserialize_bytes impls that hand the bytes of the string to to the visitor.

I was merely surprised: so far I only encountered types that use deserialize_str in their Deserialize impl, when using collect_str in their Serialize impl. I always interpreted the deserialize_* methods used to indicate the "natural" type of the value and I expected that to be the type serialized to (string in this case).

BurntSushi commented 1 hour ago

Do you agree that it makes sense given that parsing works on &[u8] and therefore, deserialization accepting either &[u8] or &str is maximally flexible? Like, if the Deserialize impl didn't accept &[u8], then you couldn't deserialize data types from bytes. It would have to be string.

Now if you're layering stuff on top of this where only strings make sense, then it should be fine if you just use deserialize_str, right? It won't be maximally flexible, but maybe it doesn't need to be. It depends on your use case.

(It's possible I have a misunderstanding here.)

chmp commented 1 hour ago

Oh sorry. I think I did not explain myself clearly. I completely agree that accepting both &[u8] and &str is a good idea.

The only thing that would change would be:

impl<'de> serde::Deserialize<'de> for Zoned {
    fn deserialize<D: serde::Deserializer<'de>>(deserializer: D) -> Result<Zoned, D::Error> {
        use serde::de;

        struct ZonedVisitor;
        // ...

-        deserializer.deserialize_bytes(ZonedVisitor)
+        deserializer.deserialize_str(ZonedVisitor)
    }
}

E.g., in serde_arrow I use the deserializer.deserialize_* call to detect the "natural" / "preferred" type of a value. serde_reflect is doing the same. The deserializer is still free to hand your visitor a &[u8].

As I said most types I encountered were symmetric in the Serde type they serialized to and the type they requested from the deserializer. Not sure whether this is a general good practice in the implementation of the Serde traits. Hence, I wanted to give you a quick heads up.

BurntSushi commented 1 hour ago

You're saying that if I used deserialize_str instead of deserialize_bytes, then it could still accept bytes? Maybe I have a misunderstanding of Serde itself. What is the actual behavioral difference between deserialize_str and deserialize_bytes?

chmp commented 1 hour ago

Puh. I have to admit the Serde API with its asymmetry between Serialize and Deserialize is still somewhat confusing. Re. this issue:

The deserialize_* methods are hints to the deserializer for how to interpret the next piece of input [...] Note that a Deserializer will not necessarily follow the type hint, so the call to deserialize_i32 does not necessarily mean the Deserializer will call I32Visitor::visit_i32

(source).

So in this case, the fact that you are calling deserialize_bytes does not all guarantee that you get &[u8], in principle the deserializer could also hand you back a i64.

I always interpreted the call to deserialize_* to indicate the "preferred" type.

BurntSushi commented 1 hour ago

Interesting. I'll leave this open for now since it seems deserving of a deeper dive on my part. The visitors in this crate do also accept both strings and bytes.

I agree that a string is, I guess, the "preferred" type in this context.

chmp commented 1 hour ago

I guess, this change would also be a visible, hence breaking, one. So not sure it's worth the effort.

BurntSushi commented 1 hour ago

Well I guess that gets to my question: what is the actual behavior difference?

In either case, my plan with Jiff is to do 1-2 breaking changes leading up to Summer 2025, and then publish Jiff 1.0. So if it's really a breaking change, that's fine, it can be included in jiff 0.2.

BurntSushi / jiff

Inconsistency between serialize and deserialize in Serde #138