feat(ChangeFeedProcessor): Programmatically delete checkpoint/trigger retraversal

bartelink commented 5 years ago

Is your feature request related to a problem? Please describe.

As detailed in https://github.com/Azure/azure-documentdb-changefeedprocessor-dotnet/issues/123 there's no programmatic way provided in the API to force a restart of a CFP projection's lease data

Describe the solution you'd like

V1 used to expose a delete leases API - that would work work for use cases I'm aware of.

Describe alternatives you've considered

Without a programmatic interface like this, one is reduced to interactively messing with lease documents, and/or writing code that couples to implementation details such as the id and/or Partition Key associated with a lease.

The only other workaround is to mint a new lease id and supply the desired arguments as that's created. This is hugely problematic when running multiple leases and/or having multiple containers being projected (i.e. I can't/dont want to be maintaining some mapping in Consul, git or anything else that says that for container 3 we're using the default2 projection because we wanted to reset it)

Additional context

https://github.com/Azure/azure-documentdb-changefeedprocessor-dotnet/issues/123

kirankumarkolli commented 5 years ago

@bartelink can you please share more context on why processing needs to be reset?

bartelink commented 5 years ago

If I want to blow away a projection and have it re-filled - imagine I'm indexing into a SQL DB by visiting each doc in order. Alternate example - I emit everything to Kafka and feed from there, but the lifetime is 7 days on the topic - if I want to have the Kafka consumers reprocess from the start, I need to re-emit all the data. While its always possible to have an all, all2 and all3 leaseId, then I'm left keep track of which collection has which primary leaseid.

The issue is that, while the there is ability to specify the start point in minute detail when its first created, this is lost the second the lease is established. While it's theoretically easy for me to write code to manipulate the lease document, doing so would couple me to something that's encapsulated.

As noted in the linked issue, this need was originally catered for in the V1 CFP APIs, but the way in which it was exposed was deemed problematic (presumably due to race conditions and/or it not addressing an actual need).

Longer version: When reading from EventStore, I maintain checkpoints like this. I use this to implement controls like this, driven by logic like this. I'm not asking for this level of control (though I can imagine others having uses for it), merely the facility to kill a lease. Once I have such a facility, the API affords enough control to be able to provision the new one as intended (even if it's only a choice of start from 'here' vs 'start from the beginning').

kirankumarkolli commented 5 years ago

Is the idea that the new API used in out-of-band tool to reset the leases? Possibly ensure that processor was not running during this process.

bartelink commented 5 years ago

Yes, stopping all consumers would be a given for such an out-of-band explicit reset; there is no expectation that the CPF consumers coordinate to shift to some specific new state (which seemed to be implied in the previous APIs though not called out as a specifically defined set of behaviors)

ealsur commented 3 years ago

@bartelink Is this still needed?

bartelink commented 3 years ago

Yes, from my perspective there continues to be the same need to be able to ask the Change Feed Processor system to purge its state.

My reasoning for this is that:

CFP logic in both CFP2 and Microsoft.Azure.Cosmos owns the writing, naming and the contract of the leases+checkpoints - it's a black box
therefore that system should provide a mechanism to clean up its state

The only real workarounds I am aware of are:

have one set of leases per aux container; whack it and start again (but that wastes capacity and is not compatible with usages where there are an interesting number of CFPs running against a given monitored container)
understand the naming and format and go hacking in there (and programmatic equivalents of that then become version Cosmos SDK version dependent)
keep coining new version-sufficed editions of the LeaseId (aka consumer group name) and/or generate ephemeral ones each time (but that leaves dead state and can be problematic)

So yes, a basic API to delete all leases and checkpoints would be very welcome indeed. The specific place I'd make use of it is in this dotnet tool

it presently has a propulsion init cosmos -c container feature to generate a fresh Lease Container
I would add a propulsion destroy-leases --leaseId=MyLease cosmos -c container that would call this feature

This would allow one to replace existing workflows where test rigs generate ephemeral lease ids and do lots of juggling to make that work.

jbockle commented 1 year ago

is this on the roadmap? also using event sourcing with multiple change feed processors, no idea how to replay events without creating a new processor name and orphaning the existing lease/checkpoint.

ealsur commented 1 year ago

@jbockle If such an API existed, how would you use it? How would you coordinate that an instance calls this API while the other instances are running? Would you turn everything off, call this API and then turn it on?

jbockle commented 1 year ago

@ealsur Sorry missed your reply - similar to @bartelink CLI propulsion's CLI approach, initially I would probably stop all instances, delete the lease and associated projections, then deploy instances and/or restart processor instance. Eventually would like to bake this into runtime so I can do this without stopping my instances - would have to track which lease each instance is currently using and their state, stop the processors on all the instances, delete lease/associated projections, then restart the processor instances.

ealsur commented 1 year ago

The problem is that without all that coordination, then a misuse of such an API can generate a splash event. If the user just calls this API with a bunch of instances already running, it will cause an inconsistency, those running instances could potentially be consuming the Change Feed with some continuation and when the lease gets to the point of attempting to be updated, it will fail and there is no recovery. Or another case where all the instances call the API at the same time, one of the instances could be trying to start after resetting and another (slower) might wipe the starting midflight. It requires a very well coordinated process that is external to the API itself, opening the door for Issue reports of "I used this API and it had this or that problem", which then requires debugging, until the root is a bad coordination (which depends on how well the user logged their current coordination, if any).

bartelink commented 1 year ago

Does/how does the CFP lib logic handle deletes at present? i.e. if the leases and checkpoints get removed, is the logic able to recover from that and restart per the config (be that from the start or a specific startime etc). For usages I'm aware of, relatively straightforward semantics of "and if the lease/position data is removed, the behavior is equivalent to when a read is triggered without a document present on startup". The uncovered case would be that a checkpointing operation should not then succeed (i.e. write an updated checkpoint position when a restart/reinitialization should have been triggered.)

It's easy for me to say as a commenter without true skin in the game, but these semantics should probably be pinned in the test suite, and likely already are?

ealsur commented 1 year ago

If the lease is deleted while being processed, then the checkpoint will fail. The checkpoint is a Replace operation, so it will fail with a 404. This causes the running Task to stop, the lease would then be attempted to be released, which would again fail (Replace => 404), the SDK will log the error through the Notification APIs and stop the Task. After some time (Acquire time), the lease container would get scanned to see if any leases are up for taking, it will see none. Eventually, the processor will hit 404 on all leases, and eventually release them, but it is not a deterministic process, you cannot tell when the whole process will complete.

Having an API that deletes all leases does not guarantee that after the method completes, the running processors, if any, are reset. It requires coordination of instances.

Azure / azure-cosmos-dotnet-v3

feat(ChangeFeedProcessor): Programmatically delete checkpoint/trigger retraversal #510