MeltanoLabs / Singer-Working-Group

Working group for ongoing development and iteration of the Singer Spec, the de-facto protocol for open source data connectors. Please use "Issues" to create discussion items - or use "Discussions" for general questions.
Apache License 2.0
13 stars 4 forks source link

Reference implementation for `tail=true` setting, enabling indefinitely-running, interruptible processes #7

Open aaronsteers opened 2 years ago

aaronsteers commented 2 years ago

For streaming use cases, an always-on tap should be able to "tail" the data source and continually run the extract. This is possible today without changes to the spec, but there is not yet any guidance or "best practice" in terms of how to accomplish this.

A reference implementation would likely entail one or more of the following:

  1. A setting like tail (boolean) which, when true would put the tap into an infinite loop of extract, sleep, extract, and so on.
  2. To honor resumability on interrupt, something like a state_message_min_cadence setting could be established, which forces a state message to flush at least every 1-5 minutes, for instance, assuming 1 or more records have been sent since the last STATE message.
  3. For incremental streams, something like a new_record_polling_interval setting could indicated how much time to sleep between calls to check for new records.
  4. To keep non-incremental streams whole while running a tap in tail mode, we'll want to have a max_full_table_age or similar, which would drive new extracts of streams pulled with FULL_TABLE after the set amount of time has elapsed.
  5. Receiving a SIGTERM message (before SIGKILL) should immediately trigger a flush the latest STATE message. In an ideal scenario, sending the termination command (like ctrl+c) should give the tap a few seconds to send on the last viable state bookmark and give an optimal chance of not having to repeat records on a subsequent execution.

Note: While the priority here is probably the tail setting, it should be fair to assume that a method that needs to run indefinitely should also be ready to be interrupted at any point; since the only way to "stop" the stream is to kill the process, we will need to be ready for interruption and plan for how STATE tracking will be affected by the worst-case scenario interruption timing.

aaronsteers commented 2 years ago

@pandemicsyn has graciously offered to draft a proposal doc for this item. 🙌

alexandersack-iex commented 1 year ago

Hello, I really need this and plan to implement my own version of it in my tap. Where is the current status of this issue?

We have desire to run a tap in "batch" mode to dump records as well as "streaming" (or "tail") mode which listens for new events and pushes them to the target accordingly.

Question, is there any reason the singer wrappers don't expose a simple optional REST interface (think management backplane) for start, stop, shutdown etc. instead of using signals which can get dicey depending on language/context?

So when a tap starts it also starts a simple REST interface that allows you to control the tap by sending simple JSON commands (messages, just like the spec does for RECORDs).