Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

Add support for atomic StartNewAsync that would execute only if instance is not already running (for singleton orchestrator support) #367

Closed SimonLuckenuik closed 3 years ago

SimonLuckenuik commented 6 years ago

Could we add an atomic "StartNewAsyncIfNotRunningOrCompletedOrInError" ( name to rework ;-) )? If the instance already exists, keep it running, if not ignore the start request and notify that it was already running.

Currently, we have to check the status, and if not running than start it, which is not concurrency friendly at all (could easily have 2 requests starting the workflow).

I really want something like that for singleton orchestrator scenarios (example: a workflow singleton is started for each user identifier).

Other way would be to have support for Singleton (https://github.com/Azure/azure-functions-host/issues/912) with the consumption plan so that I can safety check the status and call StartNewAsync om the same concurrency safe scope.

jeffhollan commented 6 years ago

Like this feature idea. Out of curiosity on the scenario motivating it? I know for Logic Apps before we implemented that to have singleton workflows was often around timer triggers (set recurrence of every 1 minute, but one instance may take 1 minute and 5 seconds to complete). So instead of having this cascade of overlapping instances, singleton provides a way so only 1 job is running at a time. Not sure if you had other triggers in mind too.

That said I do think this makes sense and imagine it would also require creating some type of locking so two instances couldn't be created at the same time.

cgillum commented 6 years ago

I've received requests from this from a variety of sources, internal and external, and it's an adoption blocker for some. We should definitely implement this. (Singleton support in Azure Functions would be awesome too because it makes this less of an issue, but I won't block on it).

@SimonLuckenuik the one caveat is that there are several people who expect to be able to "recreate" their named instances after they have already started. The most common case is when a singleton instance has failed or was terminated. For your use case, would it be acceptable to fail instance creation only if it is in a running state?

Here is what I'm currently thinking:

Would that behavior work?

SimonLuckenuik commented 6 years ago

@jeffhollan, my scenario is the following: 1) Assume a workflow similar to the Human interaction scenario 2) The workflow is not created upfront, only upon the first interaction with a user or a notification receive through Event Grid. 3) Both the user interaction and the notification are happening at the same time --> need to make sure that the workflow is started once (currently, if the 2 triggers call StartNew one after they other, the second call with cancel the first one and force a restart of the workflow).

SimonLuckenuik commented 6 years ago

@cgillum, what you are suggesting should be sufficient for my need. Some comments below.

There are currently 2 overloads for StartNewAsync, one where the instanceId is generated, another where it is provided by the caller: A) Task<string> StartNewAsync(string orchestratorFunctionName, object input)--> returns the generated instanceId B) Task<string> StartNewAsync(string orchestratorFunctionName, string instanceId, object input) --> returns the provided instanceId

I am assuming that the behavior you are describing is only applicable for B) : 1) StartNewAsync should throw an exception if a running instance of a specified ID already exists 1.1 consider, instead of exception, changing the return value to a boolean or an enum or the orchestrator status to reflect if a new instance was started or not. Maybe with small naming change to TryStartNewAsync (a la Enum.TryParse), making it if statement friendly

2) StartNewAsync should recreate a terminated or failed or completed instance of a specified ID if it already exists. 2.1 consider allowing the user providing which status are acceptable, maybe completed is not applicable for some usecases

3) StartNewAsync should be thread safe Function App safe (concurrency when Function App is scaled up to multiple instances).

Concerning the caveat, personnaly, I don't like the current behavior of B) where you start something that will in fact, in some cases, stop something and restart it. I would rather have the user to explicitly terminate the workflow before being able to restart it.

cgillum commented 6 years ago

I just want to update this issue so to clarify what happens today and what I think we need to do for the final implementation:

  1. StartNewAsync will always create a new instance, whether the ID is in use or not. If the ID is in use, that instance will be silently overwritten. We should change this behavior (though it is a breaking change). In order to make it thread-safe, we will need to also include a change in DurableTask.AzureStorage.

  2. StartNewAsync will recreated terminated, failed, or completed instances of a specified ID if it already exists. I think we should preserve this behavior moving forward.

  3. Orchestrator-specific messages like StartNewAsync are already affinitized to a particular worker instance, so even if things are scaled out, only a single worker would process duplicate messages. The exception to this is if there is a lease failover, it's possible that these messages may arrive on different VMs. Any logic we do for dup-detection/overwriting will need to use the appropriate synchronization mechanisms (e.g. eTags) to ensure that we get the correct behavior in all cases.

SimonLuckenuik commented 5 years ago

FYI, Singleton attribute is working in Azure Functions, with specific consideration around billing: https://github.com/Azure/azure-functions-host/issues/912#issuecomment-419608830

We use that as a workaround where we start the orchestrator for now, we accept being double billed for it.

calloncampbell commented 4 years ago

Good day, I'm having issues with Durable function being stuck in pending status on Azure ( I do t get this issue when running locally). My function is also following the singleton pattern. Any status update on this issue?

Anything I can do to 'cleanup' my durable function state without blowing everything up?

cgillum commented 3 years ago

We finally got around to fixing the race conditions associated with the check/start pattern in https://github.com/Azure/durabletask/pull/528. This includes handling the case where multiple instances of the same app race with each other to create an instance. There won't be any new APIs, but issues such as getting stuck in the Pending state or having duplicate executions should no longer be observed starting in the v2.5.0 extension release. Closing this issue as fixed.

elliott-with-the-longest-name-on-github commented 2 years ago

@cgillum

Sorry to resurrect this, but I'm trying to figure out how these interactions work.

Our use case is a "windowed aggregator". Given an identifier (say, an employee ID), when a request is received, the orchestration starter should:

  1. Start an orchestrator with an ID matching the incoming ID if one does not already exist
  2. Send an event with the request data to the running orchestrator

Internally, the orchestrator will do two things on startup:

  1. Start a timer for a preset interval (in this specific case, one hour)
  2. Waits on incoming events
  3. Races 1 and 2 (so if an hour passes, it fails, and if 2 completes, it succeeds)

When it receives an event, it processes the event and checks whether it has received all events it expects to have received within the hour. (This is defined externally in an Azure Table.) If it has, 2 completes and the orchestration exits successfully. If it has not, it waits for more events, or until 1 finishes.

Since the orchestrator is internally queueing events, we don't have to worry about race conditions internally. However, I'm seeing some results that suggest to me that the HTTP starter endpoint is racing.

It appears that, if two requests are received simultaneously, they will both start the orchestrator (one overwriting the other). If one of them has already made it to the "send an event" line when the other reaches "start the orchestrator", the orchestrator is replaced, swallowing the event and resulting in the eventual failure of the orchestrator.

It seems like a function similar to the one proposed above would still be extremely useful for a case like this. Instead of:

               var existingInstance = await starter.GetStatusAsync(subId);

                if (existingInstance == null
                || existingInstance.RuntimeStatus == OrchestrationRuntimeStatus.Completed
                || existingInstance.RuntimeStatus == OrchestrationRuntimeStatus.Failed
                || existingInstance.RuntimeStatus == OrchestrationRuntimeStatus.Terminated)
                {
                    await starter.StartNewAsync("subscription-orchestration", subId, subId);
                }
                await starter.RaiseEventAsync(subId, "new-dependency", newDependency);

We could have:

               await starter.StartNewButNotReplaceAsync(id); // idempotent during an orchestrator's lifespan
               await starter.RaiseEventAsync(id, "name", input);

The problem to me, it seems, is that if the stars align just so, it's possible to overwrite an existing instance after it has already received events, with no possible way to prevent it. Consider a super-contrived example:

                // Part 1
                var existingInstance = await starter.GetStatusAsync(subId);
                await Task.Delay(10000);
                // check status, start and send event
                // Part 2
                var existingInstance = await starter.GetStatusAsync(subId);
                // check status, start and send event

It seems obvious to me that with the delay there (which could be any sort of blocking network operation), Example 2 would simply replace the orchestrator which had already processed the event from Part 2.

Am I wrong about this? Is there some other behavior I'm experiencing?