estuary / connectors

Connectors for capturing data from external data sources
Other
38 stars 10 forks source link

source-postgres: Additional replication slot validation, take two #1679

Closed willdonnelly closed 4 days ago

willdonnelly commented 1 week ago

Description:

This is a revert of a revert, plus a fix for the bug which required a revert. Previously merged as https://github.com/estuary/connectors/pull/1653, the goal of this PR was just to add some checking around Postgres replication slot validity so that we can provide the user with more friendly error messages when stuff goes wrong.

However, previously this failed because one of the replication slot info columns we request is wal_status, which turns out to have only been added in Postgres 13. We definitely do want that information if it's available, because when wal_status = lost we want to be able to give the user a friendly error message telling them to drop the slot and backfill everything, instead of the current error they'll see which is a significantly less friendly ERROR: cannot read from logical replication slot \"flow_slot\": This slot has been invalidated because it exceeded the maximum reserved size. sort of message (which is entirely accurate but doesn't really give a non-expert any hints about how they're supposed to fix the situation).

This has been fixed (it's a hacky little kludge but coalesce(row_to_json(tbl)->>'column_name'::text, 'default') does the trick for a conditional-column-select-with-default expression), and I have verified that nothing else obvious breaks when run against Postgres 10 (the earliest we support).

But once burned, twice shy, I'm probably going to spend some more time testing this before merging again. We've got a couple of Postgres tasks in production that are on a bit of a hair trigger when it comes to downtime turning into permanent failures, so I'll probably also be extra careful with deploying this change by merging it in the morning and then issuing a rolling restart of all Postgres captures so I can make sure there aren't any new failures.


This change is Reviewable