Open willdonnelly opened 2 weeks ago
@jgraettinger I'm intrigued by the idea you mentioned during sprint planning, of using the capture creation timestamp as a sort of implicit "feature version" for should-be-trivial-but-we-worry-about-backwards-compatibility changes like this. Is that something we currently have access to, or something that would have to be added to the protocol+runtime?
It's available in a very ham-fisted way. Current CaptureSpec's shard_template.id
's encode the publication ID which first created the task as the last path component, and that publication ID embeds a timestamp and can be ordered lexicographically. But, we also still have captures which pre-date the control-plane change which introduced this convention, and those shard_template ID's don't have it.
Having it available as a structured field would require new control plane and protocol plumbing.
A general task creation timestamp for enabling certain behaviors would be generally useful as well I think, beyond this one particular case. For example, we might want to eventually to change how we quote/capitalize Snowflake columns (see this discussion). There's also a couple of things about our BigQuery materialization I wish we could do differently - we've historically had to transform column names, but now we could get away with not doing that, and there's the issue with creating array/object columns as TEXT instead of JSON.
Materializations are a little different since it is at least possible to imagine a mechanism where we introspect existing tables & columns and try to infer what conventions to use for the materialization, but I think the simplicity of "tasks created after this date do this thing" is appealing.
Currently we serialize Postgres
date
columns as an RFC3339 date-time in the capture output. In cases where people care at all, they would probably prefer for us to preserve them as purely dates rather than adding a fictitious time component.In one sense this is a trivial change: modify the discovery type mapping code to emit
type: string, format: date
instead offormat: date-time
and the value translation code to serialize these column values as dates instead of RFC3339 timestamps. It'd take about 10 minutes and most of that is double-checking how exactly we receive these values from the Postgres client library.Unfortunately, there's an unknown number of preexisting captures which currently capture dates as timestamps, and this would be a breaking change to all of them. Thus our options are:
format: date-time
for the moment.date
columns and hope really hard that they're few/small enough that we'd be willing to just re-backfill all of them end to end.date
columns) and then implement the new behavior so that future captures will always have them beformat: date
.