estuary / connectors

Connectors for capturing data from external data sources
Other
48 stars 13 forks source link

SQL CDC: handle dropped and renamed tables #1884

Open jgraettinger opened 2 weeks ago

jgraettinger commented 2 weeks ago

Currently when a table being captured by a SQL CDC connector is dropped or renamed, we immediately error out and require someone to manually increment the backfill counter or disable the binding before we'll allow progress again. But in practice, 99% of the time a dropped table is immediately recreated and a renamed table has some other staging table renamed to the previous name. So it's actually pretty dumb that we require the user to go push that button manually when we could just do it automatically for them.

We should instead automatically reset a table back to "Pending" when we see a drop or rename event which causes it to no longer exist, and then restart the capture task (probably at the end of the current stream-to-fence cycle, so that if multiple tables are dropped/renamed we get them all in a single restart). Once it comes back the capture will get caught up on replication and then start a new backfill of the table as though it were a newly added binding.

If that's all we do, then a dropped and not recreated table will still result in a "table ___ is a configured binding of this capture, but doesn't exist or isn't visible with current permissions" prerequisite validation error. I think this is probably still correct, because having a state where a configured binding of the capture can just silently not be doing anything seems pretty counterintuitive to the user, but I can see arguments either way.

One possible compromise would be to keep checking table visibility as a prerequisite when validating a task spec, but remove it from the capture and instead check table existence/visibility only when we go to actually start the backfill. This would mean that a missing table would cause regular task failures (which would show up in the UI) but wouldn't block ongoing replication. We could even get a step fancier and change the table-activation logic so that we don't error out until/unless we're caught up and all ongoing backfills are complete if we really want to minimize the impact here of a dropped table.

willdonnelly commented 2 weeks ago

One thing which occurs to me: once we have a concept of marking tables as no longer usable, we could fairly easily plug that into some of the MySQL "metadata incorrect" sanity checking paths so that the occasional metadata inconsistencies we see (which we currently believe are caused by explicitly unlogged DDL alteration queries) result in the offending table getting automatically reset and re-backfilled.