Schema classes do not permit schema evolution

iMacTia commented 1 year ago

We recently bumped into an issue in our application that was caused by the evolution of one of our Avro schemas. The schema had previously been used in our Rails app to generate a schema class (we use the use_schema_classes option) and was successfully being consumed.

Earlier this week, the producer of the messages using this schema decided to evolve the schema. They followed the documentation and made a "backward compatible" change by adding an optional field. This action is permitted under the schema evolution settings and as explained in this guide.

As soon as the schema was evolved and new messages were produced with the new schema, our app began to raise errors. The problem is that the consumed payload will now contain the new (optional) field, which in theory should be ignored by consumers that have not been updated yet. However, since Deimos schema classes use a keyword arguments initializer, this is causing an unknown keyword: :new_field_name (ArgumentError) error when the deserialized payload is used to initialize the schema class.

After discussing this internally, we agreed this looks like a bug in the way schema classes are implemented, too rigid and unflexible to allow schemas to evolve in a backward compatible way.

Before submitting a PR, I wanted to report this issue and ask if anyone else has had this issue already. Are we doing something wrong, or is there something we're missing?

Solution Proposal

There are different ways this could be solved.

The code initializing the schema class (Utils::SchemaClass) could whitelist the list of fields before calling the schema class initializer.
The schema class initializers could be generated with an extra **_kwargs argument that would swallow any extra fields passed to the initializer, thus allowing these to be safely ignored in this instance.

Altough my preference would be to go with solution 1 (more explicit behaviour), this has 2 downsides: i) it will only work if users use the Utils::SchemaClass helper; and ii) it would require the helper to somehow fetch the list of expected fields from the schema class.

For these reasons, solution 2 might actually be easier to implement, but I'm not sure if this might have undesired side-effects.

dorner commented 1 year ago

Hmm... if you have your reader schema in your repo, and you run the schema class generator on that schema, you should be able to re-run it when you update the reader schema and it should reflect what's on that schema. When you add the field, you re-run the generator and it should reflect on the class.

Is that not working? If so I might be misunderstanding your workflow.

iMacTia commented 1 year ago

Oh yes, if we were to download the new version of the schema and re-generate the schema class, everything would work 🙌 ! (in fact, this is what we did once we spot the error 😄)

BUT, we work in a complex system with hundreds of services, so e message might be consumed by many services, and we can't rely on them being updated at the exact same time the schema evolves.

However, you might be right actually, reading the Order of upgrading clients section, it says clearly that for backwards compatibility:

there is no assurance that consumers using older schemas can read data produced using the new schema. Therefore, upgrade all consumers before you start producing new events.

So in our case, we might have done it backwards, we should have updated the consumer first, then the producer, and we would have avoided this issue.

I'll bring this back to the team and we'll discuss it further, thank you for the super-quick and eye-opening reply 🙏

iMacTia commented 1 year ago

Closing for now, will reopen if we have any follow up

dorner commented 1 year ago

there is no assurance that consumers using older schemas can read data produced using the new schema. Therefore, upgrade all consumers before you start producing new events.

This actually depends on your compatibility settings. Adding a new field with a default is fully compatible both ways and shouldn't break readers. This is more about workflow. We have the reader schemas actually live inside the repo, so any change has to happen in the repo. If you have to explicitly change your reader schema, you change the schema class at the same time without issue.

If you rely entirely on the writer schema, it's dangerous because you're now getting more (or less) information without realizing it. If that is your workflow, then you're right, the schema classes don't entirely work because they don't have backwards compatibility built in.

iMacTia commented 1 year ago

We have the reader schemas actually live inside the repo, so any change has to happen in the repo. If you have to explicitly change your reader schema, you change the schema class at the same time without issue.

I wish it was that simple for us 😄

Our "source of truth" for schemas is a cloud schema registry. We update that via source control using a Gradle pipeline. That means that when another team evolves one of the schemas we consume, our application's (and all the other applications') reader schema is not updated automatically. We have to manually pull the new version of the schema and re-generate the schema classes.

Our understanding was that we could do this AFTER the new schema started being produced, relying on the old schema version, but it turns out backwards compatibility doesn't work like that and we should have updated our reader schema before this was used to produce messages.

This at least is what we got from the Confluent documentation I shared above, so that's where we should adjust our workflow. So although the change I suggested above could have made this work, it seems like we need to change our workflow to make sure schema evolved with backward compatibility are not produced until all consumers have updated their reader schema

dorner commented 1 year ago

Backwards compatibility should work like that. If you add a new field, and that field has a default, reader schemas that don't have that field should silently ignore it. Similarly, if you delete a field that has a default, reader schemas that do have the field should continue to see the field - it'll just be given the default value in all cases.

I'm a little confused here though - does your reader schema live in the repo or not? If not, how do you retrieve it and how is it confirmed to be different than the writer schema?

iMacTia commented 1 year ago

It does live in our schema, but it's just a copy. The original lives in the schema registry and we need to update our copy after each schema evolution. The issue we have happens if a message is produced with the new schema before our local copy is updated.

If you add a new field, and that field has a default, reader schemas that don't have that field should silently ignore it

So you think this issue might still be relevant?

dorner commented 1 year ago

It might be. The question really is "why are you getting this new field if you have NOT yet updated your reader schema?" The resolved payload should not have the new field if the reader schema hasn't been updated.

iMacTia commented 1 year ago

The resolved payload should not have the new field if the reader schema hasn't been updated.

Aha, I see what you mean now. I think I know why. We are not using Deimos consumers directly, we wanted to send multiple messages (from multiple topics) to the same consumer class. So we're using Phobos directly to consume and AvroTurf::Messaging to deserialize.

I suspect AvroTurf::Messaging is connecting directly to the schema registry to deserialize a message, picking whatever version of the schema was used to serialize the message. This might differ from the local copy, which we used to generate the schema class.

Apologies, I should have mentioned this sooner. So basically we have a local copy of the reader message that we use to generate the schema class, but not to deserialize the message. For that, we request the schema directly from the schema registry at runtime.

Appreciate this might be quite a custom setup. Based on what you said it seems like this issue wouldn't arise if we used Deimos consumers with the local version of the schema 👍

dorner commented 1 year ago

Ah OK. I'd definitely recommend using AvroTurf instead of AvroTurf::Messaging to deserialize the message to ensure it's using your local copy.

iMacTia commented 1 year ago

I'd definitely recommend using AvroTurf instead of AvroTurf::Messaging to deserialize the message to ensure it's using your local copy.

That is a really interesting suggestion, I'm gonna look into that. Thank you 🙏 !

flipp-oss / deimos

Schema classes do not permit schema evolution #193

Solution Proposal