Ten0 / serde_avro_fast

An idiomatic implementation of serde/avro (de)serialization
GNU Lesser General Public License v3.0
20 stars 4 forks source link

"failed schema validation" error when interacting with Pub/Sub #5

Closed realsama closed 9 months ago

realsama commented 10 months ago

I am encountering an issue with the serialization of Avro schemas containing nullable fields. Specifically, I have a schema with a nullable field with the schema defined as follows:

{
    "name": "created_on",
    "type": ["null", "string"],
    "default": null
}

In my Rust struct, I'm using Option to represent this nullable field. However, after serialization, serde_avro_fast does not seem to return Avro's preferred format of created_on: {string: "value"}, resulting in a Message failed schema validation error when interacting with Pub/Sub.

Expected Behavior

I expect serde_avro_fast library to serialize nullable fields in the Avro schema according to the Avro specification, producing a format like created_on: {string: "value"}.

Thanks

Ten0 commented 10 months ago

Hi! Thanks for opening this issue.

I've added a round trip test for this to the test suite and it seems to pass, so it seems that I might have misunderstood what you're expecting: https://github.com/Ten0/serde_avro_fast/blob/b4c3fe0042d33e7ec424cfe1695213a22811964f/tests/round_trips.rs#L69-L72

Trying to understand it:

Thanks! 🙂

realsama commented 10 months ago

@Ten0 Thanks for replying.

My initial struct looks like the one below.

#[derive(Debug, Serialize, Deserialize)]
pub struct GenericEvent<'a> {
    pub action: EventAction,
    #[serde(borrow)]
    pub entity: GenericData<'a>,
    pub created_on: Option<&'a str>,
    pub updated_on: Option<&'a str>,
}

Trying to serialize and pushing this to pubsub triggers a schema validation error. if I try to to explicitly model the "field: {string: "value"}" semantics, I can now publish to pubsub directly by using "serde_json::to_vec".

#[derive(Debug, Serialize, Deserialize, Copy, Clone)]
pub struct StringField<'a> {
    pub string: &'a str,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct GenericEvent<'a> {
    pub action: EventAction,
    #[serde(borrow)]
    pub entity: GenericData<'a>,
    pub created_on: Option<StringField<'a>>,
    pub updated_on: Option<StringField<'a>>,
}

My premature conclusion was that serde_avro_fast is not deserializing correctly. Based on your comment, is my schema faulty? Thanks

Ten0 commented 10 months ago

after serialization, serde_avro_fast does not seem to return Avro's preferred format of created_on: {string: "value"}

I don't understand what this means. Avro is a binary format. There is no such thing, after serialization, as created_on: {string: "value"}. Serializing a String or &str or Option<&str> that contains "value" or Some("value") into the following schema: {"name": "null_or_string","type": ["null", "string"], "default": null} is expected to produce the following bytes (serializing by hand from memory so please don't be too harsh if I get it wrong):

02 0a 76 61 6c 75 65

where first byte is the union discriminant, second is the string length, and the rest is the ascii encoding of "value". I'm pretty sure that's what you obtain when calling a serialization function from this library, e.g. from_datum_slice.

My initial struct looks like the one below.

What data do you fill it with that isn't serialized as it should? What serialized value do you obtain with to_datum_vec and what would you instead expect to obtain? (Also I can't see what EventAction is or what GenericData is)

My premature conclusion was that serde_avro_fast is not deserializing correctly

I thought this was about serialization? ("after serialization, serde_avro_fast does not seem to return Avro's preferred format") If it is about deserialization, what schema with what datum would you expect to deserialize into what structs (with e.g. from_datum_slice) and doesn't, or how are the resulting values in the struct counter-intuitive?

pub created_on: Option<StringField<'a>>,

IIRC this in particular can't be serialized or deserialized in/from a ["null", "string"] union using this library, but that would be expected behavior since StringField is a Rust struct, so ~a Avro record or map, and not an Avro union (~Rust enum). If you want to (de)serialize into ["null", "string"] using this library, I would recommend using Option<&'a str> (or even just &'a str would work if in your app you're always setting this field). Alternately enum MyEnum<'a> { String(&'a str) } or struct String<'a>(&'a str), or Option of those things, or enum MyEnum<'a> { Null, String(&'a str) } would also work.

Trying to serialize and pushing this to pubsub triggers a schema validation error

I seem to be missing the schema and struct instances here, as well as the code you're using to serialize. Can you please provide a full example with full structs and schema and data to put in the structs and function calls that do something unexpected?

There's this thread that is pubsub-related where somebody seems to have a similar issue to yours: https://www.googlecloudcommunity.com/gc/Data-Analytics/Not-supporting-AVRO-schema-with-default-null-value/m-p/607765/highlight/true Somebody there seems to think that "In Avro, if a field is declared as a union of types, the data for that field needs to be serialized as a JSON object where the key is the type and the value is the actual value" But that is incorrect according to the specification of Unions: https://avro.apache.org/docs/current/specification/#unions-1

They seem to be doing the same thing as you're trying to do though, but I'm struggling to understand what it is. Something like deserializing avro via this library then pushing it as json (via serde_json) to another service that will try and re-serialize as avro based on the JSON? (And what they call "in avro" is actually "in pubsub's own json-to-avro convertor"?) Then why wouldn't you just serialize it as avro on your end using this library and push it as such, instead of pushing json and letting it have a hard time serializing json to avro because of these weird {"string": "value"} restrictions? (Or even, if you've received something in avro format, and you try to send it back in the same avro format, why would you deserialize it at all?)

Can you please provide me with additional details as to what precisely you're trying to achieve?

NB: If you were to define a Rust enum like this:

#[derive(Serialize, Deserialize)]
enum MyEnum {
    #[serde(rename = "string", alias = "String")]
    String(String),
}

that would both deserialize from an avro ["null", "string"] when (de)serializing using this library, and serialize as {"string": "value"} if you were to put it through serde_json. But I have no clue why you would want to do that. And also that would be about serde_json's serialization and pubsub's json-to-avro convertor, not about serde_avro_fast's serialization or deserialization (which does succeed at deserializing ["null", "string"] into the intuitive dedicated Option<&str>, and does succeed as serializing the intuitive Option<&str> as a ["null", "string"] union according to the specification).

realsama commented 9 months ago

@Ten0 Thank you for your response. Apparently, this issue is due to Pubsub's treatment of Unions. Thank you!