MeltanoLabs / Singer-Working-Group

Working group for ongoing development and iteration of the Singer Spec, the de-facto protocol for open source data connectors. Please use "Issues" to create discussion items - or use "Discussions" for general questions.
Apache License 2.0
13 stars 4 forks source link

Standardization of `ACTIVATE_VERSION` message #9

Open dmosorast opened 2 years ago

dmosorast commented 2 years ago

This is something that's been frequently buzzing about the community. The "activate version" message has been used by taps in the past to track the "version" of some data. The semantics of this message haven't been formalized, so I'd like to track that conversation here, and provide my historical knowledge on the subject.

I don't have a clear picture pulled together just yet, but just getting this topic pinned for now.

aaronsteers commented 2 years ago

@dmosorast Are there any signals earlier in the stream that the target can use to predict if the stream is later going to receive an ACTIVATE_VERSION message? This could have influence on how the rather chooses to store records during the sync operation.

dmosorast commented 2 years ago

@aaronsteers That's a good point. I would say that part of the way we have implemented this in existing taps is that a tap that uses it should always send an ACTIVATE_VERSION message if it doesn't have a version saved in the stream's state. The effect of this as a signal to the target would be what you're suggesting, as well as a notification when a stream's state has been reset, and thus a new historical extraction is coming down the pipe.

I haven't had time to sit down, but the gist of my thoughts for outlining purposes here (while I'm here 😄 ) are:

Because it's a low-level signal, it doesn't support any specific feature, but more of a specific behavior which the target can take in the context of its own destination location. That said, the most common is to bring hard deletes to an EL flow when a full dataset refresh is required.

aaronsteers commented 2 years ago

Super helpful to better understand the details here, @dmosorast. Thanks for this.

To answer my question from above in this context:

Are there any signals earlier in the stream that the target can use to predict if the stream is later going to receive an ACTIVATE_VERSION message?

_Side note: If the tap changes upstream to no longer send ACTIVATEVERSION messages, then target table (or other tracking mechanism) may need to be reconfigured or recreated. But this is a niche case for target developers, which I don't think we need to solve for here.

dmosorast commented 2 years ago

I agree. The details of what the target developer is required to do to react to a version message or how to track this state is something that I'd like to keep from specifying. This is just a sort of state-like stream-level metadata.

The question of what to do if a table becomes non-versioned. That's probably what I would call a breaking change (like a PK change, or a column datatype change), which requires some case-specific means of handling. In this scenario, if a stream is known to no longer be "versioned", I would expect something like deleting the target table and backfilling with the tap to be the approach to get it "non-versioned", but that scenario is rather awkward.

Honestly, there's likely no harm in just tracking a version for every stream, it's just not something that has been common, since many SaaS sources don't allow hard deletes, so we've chosen to not add the complexity in those cases.

aaronsteers commented 2 years ago

The question of what to do if a table becomes non-versioned. That's probably what I would call a breaking change (like a PK change, or a column datatype change), which requires some case-specific means of handling.

Agreed. 👍 Very well put.

aaronsteers commented 2 years ago

Linking to related item #8, which would allow taps and targets to advertise that they support ACTIVATE_VERSION messages.