storage/sources: provide context on upstream schema versioning

morsapaes commented 4 months ago

Feature request

The schema of a source is set on creation. As the upstream schema changes, Materialize either tolerates these changes or puts the source in an errored state, but doesn't explicitly flag that a newer schema version is available. To help identify (sub)sources that have newer schemas available, we should provide users with context on upstream schema versioning.

For Kafka sources using a schema registry, we can write down the schema version used when the source was created. Ideally, we'd also ping the registry for the latest version on some cadence, so we can flip on a bool to signal there is a newer version available.
For PostgreSQL and MySQL sources, we can flip on a bool when we receive a new Relation message in the replication stream.

Once this is available in the system catalog, we can use it to annotate (sub)sources in the Console, as a first step. cc @ggnall @parkerhendo

morsapaes commented 3 months ago

Noting down a few thoughts after discussing with the team:

From @benesch:

I'm not entirely certain about the implementation here. It may be simpler/better/easier to periodically ping the upstream system to get the table's current schema and diff it against the schema in Materialize.

From @guswynn:

For PostgreSQL and MySQL, we might want to wait until sources and subsources get untangled on the statistics side before implementing this. Gus will get an estimate for the Kafka bit, which is the most pressing user need for the short-term.

benesch commented 2 months ago

Linking in a discussion that's currently occurring on Slack: https://materializeinc.slack.com/archives/C075QPD99PV/p1718079421446359

A design that seems appealingly lightweight is adding a schema_outdated: boolean column to the mz_source_statistics table.

Main concern is that this boolean should be self healing (i.e., eventually consistent). It's okay if the boolean is somewhat stale. The use case is around about advising the user when they need to update their sources; it doesn't need to be transactional or anything. But it should eventually reflect the true state of whether a new upstream schema is available.

So, we should be careful that the solution here doesn't lose knowledge of a new schema due to an ill timed crash. The source stats stuff is all best effort, so it's possible for us to e.g. see a Relation message, send a stats update that attempts to set schema_up_to_date: true, commit the Relation message's LSN, then crash before writing the stats update. On reboot we'll still be reporting schema_outdated: false but we won't see that Relation message again, so might never set schema_outdated: true.

I think the answer here is pretty straightforward. I think each source type should have a check_schema_outdated method which knows how to compare Materialize's state to the upstream's state. We can run that check in the background once on source startup, once again every few minutes or so, and, if we want to get fancy, we can trigger checks when we see events that indicate the existence of a new schema (e.g., a Relation message from a Postgres source, or a Kafka message with a schema ID we haven't seen before). We could even split the method up into fetch_latest_schema and check_schemas_match methods, so that you don't need to call fetch_latest_schema in the case of receiving a Relation message from Postgres, which I think fully describes the new schema—but I'm not sure it's worth the trouble.

One other consideration: Kafka sources have two schemas. One for the key and one for the value. I think it's fine (good and simplifying, even!) to roll those schemas up into a single boolean, but we should double check.

bosconi commented 2 months ago

@rjobanp this would a good one to do first when you return to schema change related work.

MaterializeInc / materialize

storage/sources: provide context on upstream schema versioning #26975

Feature request