aws-samples / aws-stepfunctions-examples

AWS Step Functions is an orchestration service for reliably executing multi-step processes using visual workflows. This repository includes detailed examples that will help you unlock the power of serverless workflow.
MIT No Attribution
226 stars 86 forks source link

The semaphore looks racy #3

Closed kjpgit closed 2 years ago

kjpgit commented 3 years ago

Correct me if I'm wrong, but:

The AWS network does not guarantee the dynamodb lock message won't be delayed or reordered. It could happen after "cleanfromincomplete".

So combining the above non-guarantee with a one-shot cleanup lambda that doesn't lock out future lock requests, that seems very problematic. The lock could leak, and all progress would be halted.

I see different options to fix this:

1) Read the dynamodb stream and launch a cleanup monitor every time the semaphore is locked. Thus, if a step function somehow manages to lock it again after cleanup finished, it will still be unlocked by a 2nd cleanup. The monitor could be an asynchronous express workflow (at-least-once execution of each monitor is fine) that waits until the main step function that locked the semaphore finishes. Although it wouldn't be efficient, each step function could lock and unlock the semaphore multiple times during normal operation. Multiple monitors would be queued up, but they only talk to dynamodb after they think the main step function is dead, so shouldn't interfere with "normal" step function execution.

2) Keep a log in dynamodb of the step function IDs that finished, and those ids aren't allowed to lock the semaphore again. All executions (100%) need cleanup scheduled by eventbridge. Each step function should be able to lock and unlock the semaphore multiple times during normal operation, since only cleanup locks them out. One concern is storage space in dynamodb, eventually you need to purge the log, although I believe combining this with a 'oldest_allowed_transaction_date' field that is updated daily will be 100% safe and effective as long as stepfunctions clock is roughly synced. Note: You could keep only 7 days of logs, and disallow transactions older than 1 day (meaning it took more than 1 day between when the request was created and when it was processed by dynamodb), and still let workflows run for an arbitrary amount of time, e.g a year.

3) Have a scheduled task that checks the semaphore every minute, and unlocks all locks that don't have an active step function Id. This is very low tech and has a delay/cost tradeoff in how often it runs, but at least it's reliable and easy to understand. This should be used only as a fallback; the step function should try to unlock itself if possible.

4) Add a transaction counter on the semaphore, and all updates must be sequential (Optimistic concurrency). Any mutate operation needs to be two step: read current counter, then update to N+1, and check it was still at N, as part of any update. The cleanup function launched by event bridge needs to always update the semaphore (even if it wasn't locked), which increments the counter, which will prevent a delayed lock message from a dead function being accepted. I like this except for zombies (see below).

There's a pathological case if eventbridge sees a function finished event, but the step function or its child lambdas are still alive in some way (e.g. zombies). If the zombies are still trying to take the lock, 4) would let it stay forever locked, while 1) and 3) would keep unlocking it, and 2) does not allow locks until you are past the log retention age. I'm not sure which is preferable, but zombie behavior needs to be specified, because zombies exist - see "AWS re:Invent 2020: Inversion of scale: Outnumbered but in control".

We would really like better concurrency control in AWS, but having a built-in semaphore in step functions seems more reliable, faster, simpler, and cost effective.

JustinCallison commented 3 years ago

I don't understand the statement "the dynamodb lock message won't be delayed or reordered". Can you provide a bit more detail so that I can understand?

kjpgit commented 3 years ago

This sequence of events:

1) the updateItem message from AcquireLock is sent 2) Step Function is canceled or otherwise fails 3) Event Bridge calls cleanfromincomplete 4) cleanfromincomplete runs and calls getitem/udpateItem

There is no guarantee preventing 1 from being received or processed after 4. In fact, if request buffers like api gateway are involved, in front of dynamodb's backend, it's very likely to be a problem.

JustinCallison commented 3 years ago

I see. It is technically possible that the DynamoDB Update call issued by Step Functions could fail to get all the way to the DDB table before SFN fails / aborts / times out, sends the Execution Status Change event to Event Bridge, EB triggers the cleanup state machine, and the cleanup state machine reads the DDB record. It's possible, but highly unlikely.

I think the right solution here is a "sweeper" that runs on a schedule and looks for locks recorded in the DDB record without a corresponding running execution. That way, even if a lock does get leaked (for this or for some other unforeseen reason), it will get picked up. And that state machine can do things like notify via SNS that something unexpected happened and that you should look into it. I had considered this when building the example but decided to leave it out for simplicity, but it would be a good idea to add. I'm not sure when I'll be able to get to it, but I will add that.

Thanks!

Justin