estuary / connectors

Connectors for capturing data from external data sources
Other
48 stars 15 forks source link

captures: strategy for account for historical schema decisions of an active capture #1994

Open willdonnelly opened 2 weeks ago

willdonnelly commented 2 weeks ago

Currently we serialize Postgres date columns as an RFC3339 date-time in the capture output. In cases where people care at all, they would probably prefer for us to preserve them as purely dates rather than adding a fictitious time component.

In one sense this is a trivial change: modify the discovery type mapping code to emit type: string, format: date instead of format: date-time and the value translation code to serialize these column values as dates instead of RFC3339 timestamps. It'd take about 10 minutes and most of that is double-checking how exactly we receive these values from the Postgres client library.

Unfortunately, there's an unknown number of preexisting captures which currently capture dates as timestamps, and this would be a breaking change to all of them. Thus our options are:

  1. Do nothing and stick with format: date-time for the moment.
  2. Try to establish how many capture bindings actually contain date columns and hope really hard that they're few/small enough that we'd be willing to just re-backfill all of them end to end.
  3. Implement some sort of advanced config toggle for "do the old thing" and then do a whole migration where we set that flag on all preexisting captures (or all of them which have date columns) and then implement the new behavior so that future captures will always have them be format: date.
  4. Generalize the problem and try to figure out if there's any way we could make breaking datatype changes more painless in the future, possibly by building some sort of generic compatibility-flag mechanism or even just a script to reliably bulk-edit a bunch of capture specs in production to set a new backwards-compatibility flag?
willdonnelly commented 1 week ago

@jgraettinger I'm intrigued by the idea you mentioned during sprint planning, of using the capture creation timestamp as a sort of implicit "feature version" for should-be-trivial-but-we-worry-about-backwards-compatibility changes like this. Is that something we currently have access to, or something that would have to be added to the protocol+runtime?

jgraettinger commented 1 week ago

It's available in a very ham-fisted way. Current CaptureSpec's shard_template.id's encode the publication ID which first created the task as the last path component, and that publication ID embeds a timestamp and can be ordered lexicographically. But, we also still have captures which pre-date the control-plane change which introduced this convention, and those shard_template ID's don't have it.

Having it available as a structured field would require new control plane and protocol plumbing.

williamhbaker commented 1 week ago

A general task creation timestamp for enabling certain behaviors would be generally useful as well I think, beyond this one particular case. For example, we might want to eventually to change how we quote/capitalize Snowflake columns (see this discussion). There's also a couple of things about our BigQuery materialization I wish we could do differently - we've historically had to transform column names, but now we could get away with not doing that, and there's the issue with creating array/object columns as TEXT instead of JSON.

Materializations are a little different since it is at least possible to imagine a mechanism where we introspect existing tables & columns and try to infer what conventions to use for the materialization, but I think the simplicity of "tasks created after this date do this thing" is appealing.