DurableOrchestrationClient.start_new raises exception when orchestrator instance exists with the same instance id instead of silent replace.

gegetoth commented 1 year ago

🐛 Describe the bug In the python azure-functions-durable 1.1.6 SDK the DurableOrchestrationClient.start_new() method's documentation says:

Start a new instance of the specified orchestrator function. If an orchestration instance with the specified ID already exists, the existing instance will be silently replaced by this new instance.

Instead of silently replacing a Pending instance it raises exception:

Exception: Exception: {'Message': 'Something went wrong while processing your request', 'ExceptionMessage': 'An Orchestration instance with the status Pending already exists.', 'ExceptionType': 'System.InvalidOperationException', 'StackTrace': ' at DurableTask.AzureStorage.AzureStorageOrchestrationService.CreateTaskOrchestrationAsync(TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1570\n at DurableTask.Core.TaskHubClient.InternalCreateOrchestrationInstanceWithRaisedEventAsync(String orchestrationName, String orchestrationVersion, String orchestrationInstanceId, Object orchestrationInput, IDictionary2 orchestrationTags, OrchestrationStatus[] dedupeStatuses, String eventName, Object eventData, Nullable1 startAt) in //src/DurableTask.Core/TaskHubClient.cs:line 614\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableOrchestrationClient.StartNewAsync[T](String orchestratorFunctionName, String instanceId, T input) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 210\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.HttpApiHandler.HandleStartOrchestratorRequestAsync(HttpRequestMessage request, String functionName, String instanceId) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\HttpApiHandler.cs:line 875\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.HttpApiHandler.HandleRequestAsync(HttpRequestMessage request) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\HttpApiHandler.cs:line 306'}

🤔 Expected behavior

What should have happened?

Based on the documentation I would expect it to replace the existing instance with a new one without raising exception.

☕ Steps to reproduce An orchestration client triggers an orchestration function. In the orchestration client it checks the satatus: AZURE_INACTIVE_ORCHESTRATOR_STATUS_SET = { OrchestrationRuntimeStatus.Completed, OrchestrationRuntimeStatus.Failed, OrchestrationRuntimeStatus.Terminated, OrchestrationRuntimeStatus.Pending, None, }

if instance.runtime_status in AZURE_INACTIVE_ORCHESTRATOR_STATUS_SET: instance_id = await client.start_new(INSTANCE_ID, instance_id=INSTANCE_ID, client_input={'time': time_list[i]}) logging.info(f"Started orchestration with ID = '{instance_id}'.")

So if the orchestration instance in Pending it should create a new one. The status of the orchestration instance can be flipped in the corresponding storage account table.

it can be reproduced locally in VScode as well.

davidmrdavid commented 1 year ago

Hi @gegetoth, thanks for reaching out.

It seems you've discovered a bug in our documentation, which I've just filed here.

We cannot reliably silently replace instanceIDs that are in the Pending state. In reality, it's done on a best-effort basis as there exists a race condition where you have scheduled an orchestrator to be created but that information has not yet been propagated throughout our storage. I'll be modifying our documentation to avoid confusion on this edge case.

Please let me know if this answers your question, and apologies for the confusion

gegetoth commented 1 year ago

Hi @davidmrdavid,

Thank you for clarifying it!

I wonder how the orchestrator function can be moved from the Pending runtime state? We have problem with instances stuck in Pending status and not being able to make it run after, not even when the given app service plan utilization is below 50%.

amdeel commented 1 year ago

Hey @gegetoth. It's definitely cause for investigation if your instance is not making progress out of pending. Could you provide us more details so we can look into this (DF Version, AppName, TimeStamps, InstanceIds, etc)?

Regarding the issue you originally opened this for, we do have an issue with outdated documentation here. We have a setting called OverridableExistingInstanceStates that defaults to OverridableStates.NonRunningStates. This means that by default instances will not be overridden when in running states. You can change this by specifying OverridableStates.AnyState instead.

davidmrdavid commented 1 year ago

@amdeel: thanks for the note. Just for clarity, OverridableExistingInstanceStates is a host.json setting, correct?

gegetoth commented 1 year ago

Hi @amdeel,

Could you please give me a bit more details or a link where I can read a bit more about the OverridableExistingInstanceStates setting? Does it mean that if this is set to OverridableStates.AnyState the the clien.start_new() will replace the Pending instance?

amdeel commented 1 year ago

@gegetoth @davidmrdavid Hmm, apparently this isn't included in our host.json reference documentation either. Here is the .NET documentation for this: OverridableExistingInstanceStates.

This is a DurableTaskOption, which means you include this setting in your host.json under durabletask. Host.json reference

And yes, you will be able to override any state if you turn this setting on. That being said, if you're having issues where instances are stuck in pending it's worth addressing/investigating that issue because overriding the instance wont solve the problem.

gegetoth commented 1 year ago

@amdeel thank you for all the info!

Unfortunately, this issue with the Pending state comes up quite frequently. Tried scaling up/down, stop/restart. The only solution which worked is to delete the storage account table. After that orchestrator function can be triggered.

Please find the details of one of the affected function below: Function app name: dev-agg-curve-balancer-fn Timestamp: 2022-12-06T11:11:00.0049626Z (UTC) Orchestration instance id: orchestrate2022-06-08 Message from the orchestration client: Result: Failure Exception: Exception: {'Message': 'Something went wrong while processing your request', 'ExceptionMessage': 'An Orchestration instance with the status Pending already exists.', 'ExceptionType': 'System.InvalidOperationException', 'StackTrace': ' at DurableTask.AzureStorage.AzureStorageOrchestrationService.CreateTaskOrchestrationAsync(TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1570\n at DurableTask.Core.TaskHubClient.InternalCreateOrchestrationInstanceWithRaisedEventAsync(String orchestrationName, String orchestrationVersion, String orchestrationInstanceId, Object orchestrationInput, IDictionary2 orchestrationTags, OrchestrationStatus[] dedupeStatuses, String eventName, Object eventData, Nullable1 startAt) in /_/src/DurableTask.Core/TaskHubClient.cs:line 614\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableOrchestrationClient.StartNewAsync[T](String orchestratorFunctionName, String instanceId, T input) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 210\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.HttpApiHandler.HandleStartOrchestratorRequestAsync(HttpRequestMessage request, String functionName, String instanceId) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\HttpApiHandler.cs:line 875\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.HttpApiHandler.HandleRequestAsync(HttpRequestMessage request) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\HttpApiHandler.cs:line 306'} Stack: File "/azure-functions-host/workers/python/3.9/LINUX/X64/azure_functions_worker/dispatcher.py", line 398, in _handle__invocation_request call_result = await self._run_async_func( File "/azure-functions-host/workers/python/3.9/LINUX/X64/azure_functions_worker/dispatcher.py", line 617, in _run_async_func return await ExtensionManager.get_async_invocation_wrapper( File "/azure-functions-host/workers/python/3.9/LINUX/X64/azure_functions_worker/extension.py", line 147, in get_async_invocation_wrapper result = await function(args) File "/home/site/wwwroot/analytics/azure/utils.py", line 304, in wrapper return await wrapped(*args, *kwargs) File "/home/site/wwwroot/start_history/init.py", line 14, in main await start_history_function(starter=starter, func_name='orchestrate', instance_id=INSTANCE_ID) File "/home/site/wwwroot/analytics/azure/utils.py", line 207, in start_history_function await start_daily_function( File "/home/site/wwwroot/analytics/azure/utils.py", line 173, in start_daily_function await starter_func( File "/home/site/wwwroot/analytics/azure/utils.py", line 120, in wrapper return await wrapped(starter=starter, args, kwargs) File "/home/site/wwwroot/analytics/azure/utils.py", line 325, in wrapper return await wrapped(*args, **kwargs) File "/home/site/wwwroot/analytics/azure/utils.py", line 157, in start_function instance_id = await client.start_new( File "/home/conda/lib/python3.9/site-packages/azure/durable_functions/models/DurableOrchestrationClient.py", line 89, in start_new raise Exception(ex_message)

host.json:

{
  "version": "2.0",
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "excludedTypes": "Request"
      }
    }
  },
  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[3.3.0, 4.0.0)"
  },
  "extensions": {
    "durableTask": {
      "maxConcurrentOrchestratorFunctions": 2,
      "useGracefulShutdown": true
    }
  }
}

OperationId: aa2cf6684015b1f6f281b6e64e9588cb cloud_RoleInstance: 81ca0f1ebe20ebb2e01839b112db3ed01a0902fe0b27ebcf36e8dcc30fd7f501 Python SKD versions:

azure-functions==1.11.2
azure-functions-durable==1.1.6

davidmrdavid commented 1 year ago

@gegetoth: we've seen Python orchestrators being under-utilized when users don't tweak their Python Azure Functions concurrency settings. Have you reviewed this guidance? https://learn.microsoft.com/en-us/azure/azure-functions/python-scale-performance-reference#use-multiple-language-worker-processes

In particular, you'll want to read the sections relating to adding more worker processes to your app, as well as adding more worker threads. With those two settings, I suspect you'll find that your orchestrators exit the Pending state faster. Please let me know if this is something you've already explored!

lilyjma commented 1 year ago

@gegetoth - were you able to try out David's suggestions?

gegetoth commented 1 year ago

Hi @lilyjma ,

I tried @davidmrdavid 's suggestion. Unfortunately that didn't solve the problem. It was just a temporal solution. Increased the worker process and the thread count but the same issue came up, the orchestrator ends up in a pending state after a while.

I would expect the function to recover if the CPU utilization is too high and the process is throttled but the CPU is not over utilized so that should not be the problem. The orchestrator instance gets stuck in Pending for days and it recovers if I kill the supervisor process (PID 18 on the instance). This seem like bug with durable function and causes problems for us that it is not reliable. Even if there are just 4 orchestrator instances and one timer trigger for each the functions it can end up being stuck in Pending and never recovers by itself.

My expectation would be if the CPU is over utilized the process should be throttled and it should process the tasks from the queue when it can. I would understand the behaviour if the utilization would be above 90% and the resources would be exhausted, but this is not the case.

gegetoth commented 1 year ago

I noticed that during the normal run of the orchestrator function the runtime state of the function get flipped into Pending and after Completed. What is responsible for changing the status? Is that the process by itself or is that the WebHost process.

If I kill the WebHost process when the orchestrator instances are stucked in Pending it will recover within few minutes and those instances have lets say 3 days of execution time it it was in Pending for 3 days. So What is responsible for defining the execution time of these instances and flipping the runtime status to Completed?

gegetoth commented 1 year ago

One more thing I noticed actually today.

We have 2 orchestrators instances: orchestrate_history and orchestarte_da. Based on the existence of the data in the db it either starts an entity instance with the date as and id or terminates the execution because the data is available in the db no need run entity. Both of the orchestrators do the same just with different parameters. So the orchestrate_hitory inatnce got stuck in Pending state 3 days ago. But what we noticed is although the orchestrate_da wasn't in Pending an got triggered and also started the new entity instance with the actual date (we could see it from the logs that the signalling happened) the entity instance had never been created. There was no instance with for the last 3 days in the storage account table no logs nothing. This would suggest maybe something with the dispatcher process went wrong. Is it possible to gather logs form the WebHost process?

davidmrdavid commented 1 year ago

Hi @gegetoth:

For us to be able to diagnose this, and to answer your questions, we'll need to look at the telemetry for an orchestrator instanceID that is experiencing this issue. I know you listed one earlier in the thread, but a refresher example would be best at this point (our telemetry only lasts so long).

Can you please provide me with a newer instanceID, and a timerange in UTC showing an orchestrator being stuck in the Pending status for too long? Thanks! In fact, the more examples you can provide me with, the better! Thanks :)

gegetoth commented 1 year ago

Hi @davidmrdavid ,

Will provide the instance ID and the time stamp when it happens next.

gegetoth commented 1 year ago

Hi @davidmrdavid ,

Managed to capture the issue this week: The first name is the isntance id the second is the name of the actual function.

prod-epex-scraper-func orchestrate_parse_order_da orchestrate_parse 2023-02-18T23:10:00Z 20ms Pending

orchestrate_parse_order_history orchestrate_parse 2023-02-18T23:07:00Z 13ms Pending

orchestrate_parse_result_da orchestrate_parse 2023-02-18T23:07:00Z 14ms Pending

orchestrate_parse_result_history orchestrate_parse 2023-02-18T23:05:00Z 16ms Pending

orchestrate_scrape_archive_order orchestrate_scrape 2023-02-18T23:10:00Z 13ms Pending

orchestrate_scrape_archive_result orchestrate_scrape 2023-02-18T23:05:00Z 20ms Pending

orchestrate_scrape_current_order orchestrate_scrape 2023-02-18T23:05:00Z 31ms Pending

orchestrate_scrape_current_result orchestrate_scrape 2023-02-18T23:17:00Z 14ms Pending

test-epex-scraper-func orchestrate_parse_order_da orchestrate_parse 2023-02-16T12:40:00Z 26ms Pending

orchestrate_parse_order_history orchestrate_parse 2023-02-16T12:37:00Z 42ms Pending

orchestrate_parse_result_da orchestrate_parse 2023-02-16T12:37:00Z 26ms Pending

orchestrate_parse_result_history orchestrate_parse 2023-02-16T12:35:00Z 130ms Pending

orchestrate_scrape_archive_order orchestrate_scrape 2023-02-16T12:40:00Z 48ms Pending

orchestrate_scrape_archive_result orchestrate_scrape 2023-02-16T12:35:00Z 53ms Pending

orchestrate_scrape_current_order orchestrate_scrape 2023-02-16T12:35:00Z 22ms Pending

orchestrate_scrape_current_result orchestrate_scrape 2023-02-16T12:32:00Z 138ms Pending

davidmrdavid commented 1 year ago

@gegetoth: Is the 4th value, the one in ms the amount of time it was stuck in the Pending state? I'm trying to figure out if these orchestrators are in the pending state for milliseconds, or longer. If not, do you know how long they were stuck in Pending for?

davidmrdavid commented 1 year ago

Also, for instanceID orchestrate_parse_order_da, I'm seeing it go directly from the "Pending" status to "Completed" repeatedly. Normally, I would expect it to go from Pending to Running and then Completed. If you're going directly from Pending to Completed, it would imply that you're not scheduling any work inside the orchestrator. Is that the case? I assume it's not, which may be indicative of the problem here.

gegetoth commented 1 year ago

Hi @davidmrdavid, I think that is the duration since it was last time updated but there are no more state update until I manage to recover it with for example restarting the function. In that case the first update of the state I can see is it turns into Completed but the run time is basically the time while it was in pending. So if I wait for days it will suggest that the runtime was days.

The orchestrator starts multiple entity function instances but it doesn't wait for those to finish. If data is available for a given dates it doesn't have to trigger the entity instances it just finishes the execution.

davidmrdavid commented 1 year ago

Hi @gegetoth, thanks for the response.

I need your help to get a bit more clarity here still. One challenge here is in deducing when the logs are showing me updates as a response of your recovery steps, and when I'm seeing the issue without any external intervention, so I need your help reconstructing a timeline of events.

For a given instanceID that gets stuck in the Pending state for a long time, can you please provide me with the following?

the app name, as expected
timestamp in UTC when the orchestrator started
timestamp in UTC when you took a manual recovery step, as well as clarifying what this action is (you said you're killing the process, for example?). An approximate timestamp here should suffice if you don't know exactly when this took place
timestamp in UTC when you noticed the orchestrator recovered

In short, I need a timeline of events for a given instanceID. Feel free to add there your observations regarding how this orchestrator interacted with its assigned Entity instance. For example. you mentioned in a previous message that these Entities keep track of some date but I'm not certain how that relates to this issue. If you could add these details to the timeline, as well as comments on your expectations, that will make it a lot easier for me to understand the logs I'm seeing.

Thanks!

gegetoth commented 1 year ago

Hi @davidmrdavid ,

Managed to capture one.

App name: dev-epex-scraper-func Functions: orchestrate_parse, orchestrate_scrape Instance IDs:

orchestrate_parse_result_history -> 2023-02-28T15:05:00Z, 2/28/2023, 3:20:00.000 PM
orchestrate_scrape_current_result -> 2023-02-28T15:17:00Z, 2/28/2023, 3:31:59.998 PM
orchestrate_scrape_archive_result -> 2023-02-28T15:20:00Z, 2/28/2023, 3:35:00.000 PM
orchestrate_parse_order_history -> 2023-02-28T15:22:00Z, 2/28/2023, 3:37:00.034 PM
orchestrate_scrape_archive_order -> 2023-02-28T15:25:00Z, 2/28/2023, 3:40:00.002 PM
orchestrate_parse_order_da -> 2023-02-28T15:25:00Z , 2/28/2023, 3:40:00.003 PM

The 1st timestamp is when the orchestrator got into the Pending status according to the storage account table. The second one when the Client function tried to start the orchestrator instance and the error message ends up in the logs.

If we focus on the first instance the orchestrate_parse_result_history

It was started at 14:50 and scheduled 15:05 which is correct because it should run every 15min. But it never completed and it is in Pending state since 15:05. When it runs fine it should end up in a Completed state like this:

If I check the invocation history I can see only one which was triggered at 14:50

And it also has a duration and Success state.

When I check the invocation history of the function. There is no record for 15:05.

As far as I see these are only recorded when the invocation is finished either with Success or Failure. So if there is no record that indicates to me that the system thinks it is still running.

The operation ID is: 83f532af986ea6cead8c7e9d0a715f26 The only log I can see running the following query is: union traces | union exceptions | where timestamp > ago(30d) | where operation_Id == '83f532af986ea6cead8c7e9d0a715f26' | order by timestamp asc | project timestamp, message = iff(message != '', message, iff(innermostMessage != '', innermostMessage, customDimensions.['prop__{OriginalFormat}'])), logLevel = customDimensions.['LogLevel'], customDimensions['InvocationId'], customDimensions

Only information about that it was scheduled.

I hope this things above make sense.

gegetoth commented 1 year ago

Killed the webhost process around 19:10 Now I can see that the instance is terminated:

davidmrdavid commented 1 year ago

Thanks @gegetoth, this gives me a lot of context. Just to be sure - do you also have a ticket with Azure support for this very same app and same issue? I happen to be investigating a very similar issue, for an application name that I believe is the same as the one you've provided, and I wanted to confirm with you if that was the case.

gegetoth commented 1 year ago

Hi @davidmrdavid,

That is correct we have opened an issue with Azure support. That conversation is going on for a while now and I opened this Github issue before the Azure ticket was elevated to the Technical Product team. Which communication channel do you prefer via Azure support or Github issue?

davidmrdavid commented 1 year ago

I think it's ok to have both open, and I'll try to keep these both up to date; though I may prioritize our official support channel when pressed on time. And yes, we've been working with the support team for at least a few weeks now. This is a tricky case, but it's now under the product group's (us) radar so we're actively working on it.

Let me try to summarize what I'm seeing in the logs. Please note this is still a working understanding of your case, so my understanding may change as we get more telemetry. At this time - I'm frequently seeing that your application is being requested to restart and, as it's normal, it's blocking that restart operation until it completes any outstanding work already allocated on the VM. However, the application is getting stuck in trying to complete that outstading work, therefore blocking indefinitely on some orchestrators. The tricky part is in understand why it's stuck in this step. I have some several theories, but I'll need to add a few more logs in the right places to be able to say definitively why.

I'm preparing another private package that we would like you to try. I'll be sharing that with our support point of contact shortly, and I'll look to update this thread as well to let you know once we've done that. Thanks so much for your patience with us in this case.

davidmrdavid commented 1 year ago

Hi @gegetoth:

I wanted to reach out directly here in case you needed assistance in installing the latest private packages we've prepared for you. From my latest discussion with the support engineer working on your case, it appears that the latest private packages were not successfully installed, and we need them to get access to custom new telemetry we've prepared for your case.

Just for clarity, these are the two new packages we would like you to install:

https://www.myget.org/feed/azure-appservice/package/nuget/Microsoft.Azure.DurableTask.AzureStorage/1.13.5-preview3
https://www.myget.org/feed/azure-appservice/package/nuget/Microsoft.Azure.WebJobs.Extensions.DurableTask/2.9.3-preview3

From looking at our telemetry, your application (dev-agg-curve-balancer-fn, and others) it appears your application is still using "Microsoft.Azure.WebJobs.Extensions.DurableTask" v2.9.1, whereas the private package we provided should have version 2.9.3. So something is not matching up yet.

To help diagnose this, it would help us if you could copy-paste your extensions.csproj file and your host.json in a reply to this thread. It would also be good if you could repeat the installation steps after deleting the ./obj/ and ./bin/ directories in your application's root. That will help ensure you're not utilizing a previous build of these dependencies.

Finally, there's a local test you can do to verify that the right packages are being installed. If you start your application locally via func host start --verbose, search for the log starting with the string "Starting task hub worker."

The full log should read something like this:

"Starting task hub worker. InstanceId: . Function: . HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.3. SequenceNumber: 1."

Notice above the substring that reads "ExtensionVersion: 2.9.3". If your log says ExtensionVersion 2.9.1 or 2.9.2, then the packages may not have been correctly installed. Could you please try this out and let me know what you get?

Also, after performing the validation above, please let us know if you believe you've successfully deployed these private packages to an app (and let us know the app and the starting timestamp of the change in UTC) so that I can validate on our telemetry that the right packages were picked up. Thanks!

gegetoth commented 1 year ago

Hi @davidmrdavid ,

I provided the update via the support. Managed to install the extensions.

Now for some reason a big number of functions are having issues.

One error I noticed in the logs: 2023-03-18T18:22:51.490546477Z [41m[30mfail[39m[22m[49m: DurableTask.AzureStorage[122]

2023-03-18T18:22:51.490552077Z devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started

2023-03-18T18:22:51.490555777Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245

2023-03-18T18:22:51.490559577Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140

2023-03-18T18:22:51.621821613Z [41m[30mfail[39m[22m[49m: DurableTask.AzureStorage[122]

2023-03-18T18:22:51.621848313Z devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started

2023-03-18T18:22:51.621853613Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245

2023-03-18T18:22:51.621857913Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140

2023-03-18T18:22:51.621862313Z MS_FUNCTION_AZURE_MONITOREVENT 2,dev-agg-curve-balancer-fn.azurewebsites.net,Microsoft.Web/sites/functions/log,FunctionAppLogs,,"{'appName':'dev-agg-curve-balancer-fn','roleInstance':'e5cdd3a2b3ea6b1bab30cd486062b2229f312de4effacae6c0dc28fc4821902f','message':'devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140','category':'DurableTask.AzureStorage','hostVersion':'4.1.1.1','hostInstanceId':'bd603a9b-aa2b-4981-b89f-f09052bbcb1c','level':'Error','levelId':4,'processId':18,'eventId':122,'eventName':'PartitionManagerError'}",03/18/2023 18:22:51

2023-03-18T18:22:51.702185419Z [41m[30mfail[39m[22m[49m: DurableTask.AzureStorage[122]

2023-03-18T18:22:51.702215319Z devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started

2023-03-18T18:22:51.702220819Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245

2023-03-18T18:22:51.702224819Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140

2023-03-18T18:22:51.702229119Z MS_FUNCTION_AZURE_MONITOREVENT 2,dev-agg-curve-balancer-fn.azurewebsites.net,Microsoft.Web/sites/functions/log,FunctionAppLogs,,"{'appName':'dev-agg-curve-balancer-fn','roleInstance':'e5cdd3a2b3ea6b1bab30cd486062b2229f312de4effacae6c0dc28fc4821902f','message':'devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140','category':'DurableTask.AzureStorage','hostVersion':'4.1.1.1','hostInstanceId':'bd603a9b-aa2b-4981-b89f-f09052bbcb1c','level':'Error','levelId':4,'processId':18,'eventId':122,'eventName':'PartitionManagerError'}",03/18/2023 18:22:51

2023-03-18T18:22:51.794090155Z [41m[30mfail[39m[22m[49m: DurableTask.AzureStorage[122]

2023-03-18T18:22:51.794138755Z devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started

2023-03-18T18:22:51.794155955Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245

2023-03-18T18:22:51.794172755Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140

2023-03-18T18:22:51.794177355Z MS_FUNCTION_AZURE_MONITOREVENT 2,dev-agg-curve-balancer-fn.azurewebsites.net,Microsoft.Web/sites/functions/log,FunctionAppLogs,,"{'appName':'dev-agg-curve-balancer-fn','roleInstance':'e5cdd3a2b3ea6b1bab30cd486062b2229f312de4effacae6c0dc28fc4821902f','message':'devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140','category':'DurableTask.AzureStorage','hostVersion':'4.1.1.1','hostInstanceId':'bd603a9b-aa2b-4981-b89f-f09052bbcb1c','level':'Error','levelId':4,'processId':18,'eventId':122,'eventName':'PartitionManagerError'}",03/18/2023 18:22:51

2023-03-18T18:22:51.794194055Z [41m[30mfail[39m[22m[49m: DurableTask.AzureStorage[122]

2023-03-18T18:22:51.794198955Z devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started

2023-03-18T18:22:51.794203055Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245

2023-03-18T18:22:51.794207855Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140

2023-03-18T18:22:51.794212055Z MS_FUNCTION_AZURE_MONITOREVENT 2,dev-agg-curve-balancer-fn.azurewebsites.net,Microsoft.Web/sites/functions/log,FunctionAppLogs,,"{'appName':'dev-agg-curve-balancer-fn','roleInstance':'e5cdd3a2b3ea6b1bab30cd486062b2229f312de4effacae6c0dc28fc4821902f','message':'devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140','category':'DurableTask.AzureStorage','hostVersion':'4.1.1.1','hostInstanceId':'bd603a9b-aa2b-4981-b89f-f09052bbcb1c','level':'Error','levelId':4,'processId':18,'eventId':122,'eventName':'PartitionManagerError'}",03/18/2023 18:22:51

I don't know what this means but it seems like it pushes a huge amount of error log to the stream.

gegetoth commented 1 year ago

This is the beginning of those error messages: 2023-03-18T19:39:38.669568843Z [41m[30mfail[39m[22m[49m: DurableTask.AzureStorage[122]

2023-03-18T19:39:38.669612842Z devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.NullReferenceException: Object reference not set to an instance of an object.

2023-03-18T19:39:38.669619042Z at DurableTask.AzureStorage.Partitioning.LeaseCollectionBalancer`1.RenewLeaseAsync(T lease) in /_/src/DurableTask.AzureStorage/Partitioning/LeaseCollectionBalancer.cs:line 591

2023-03-18T19:39:38.669623442Z at DurableTask.AzureStorage.Partitioning.LeaseCollectionBalancer`1.InitializeAsync() in /_/src/DurableTask.AzureStorage/Partitioning/LeaseCollectionBalancer.cs:line 92

2023-03-18T19:39:38.669627642Z at DurableTask.AzureStorage.Partitioning.SafePartitionManager.DurableTask.AzureStorage.Partitioning.IPartitionManager.StartAsync() in /_/src/DurableTask.AzureStorage/Partitioning/SafePartitionManager.cs:line 127

2023-03-18T19:39:38.669645841Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 250

2023-03-18T19:39:38.669650341Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140

2023-03-18T19:39:38.669657041Z MS_FUNCTION_AZURE_MONITOREVENT 2,dev-agg-curve-balancer-fn.azurewebsites.net,Microsoft.Web/sites/functions/log,FunctionAppLogs,,"{'appName':'dev-agg-curve-balancer-fn','roleInstance':'e5cdd3a2b3ea6b1bab30cd486062b2229f312de4effacae6c0dc28fc4821902f','message':'devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.NullReferenceException: Object reference not set to an instance of an object.\n at DurableTask.AzureStorage.Partitioning.LeaseCollectionBalancer`1.RenewLeaseAsync(T lease) in //src/DurableTask.AzureStorage/Partitioning/LeaseCollectionBalancer.cs:line 591\n at DurableTask.AzureStorage.Partitioning.LeaseCollectionBalancer`1.InitializeAsync() in //src/DurableTask.AzureStorage/Partitioning/LeaseCollectionBalancer.cs:line 92\n at DurableTask.AzureStorage.Partitioning.SafePartitionManager.DurableTask.AzureStorage.Partitioning.IPartitionManager.StartAsync() in //src/DurableTask.AzureStorage/Partitioning/SafePartitionManager.cs:line 127\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 250\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140','category':'DurableTask.AzureStorage','hostVersion':'4.1.1.1','hostInstanceId':'6e4f9e98-a788-4e21-abba-377655c17d0c','level':'Error','levelId':4,'processId':18,'eventId':122,'eventName':'PartitionManagerError'}",03/18/2023 19:39:38

2023-03-18T19:39:38.837936498Z [41m[30mfail[39m[22m[49m: DurableTask.AzureStorage[122]

2023-03-18T19:39:38.837973498Z devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started

2023-03-18T19:39:38.837979998Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245

2023-03-18T19:39:38.837984297Z at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140

2023-03-18T19:39:38.838001197Z MS_FUNCTION_AZURE_MONITOREVENT 2,dev-agg-curve-balancer-fn.azurewebsites.net,Microsoft.Web/sites/functions/log,FunctionAppLogs,,"{'appName':'dev-agg-curve-balancer-fn','roleInstance':'e5cdd3a2b3ea6b1bab30cd486062b2229f312de4effacae6c0dc28fc4821902f','message':'devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140','category':'DurableTask.AzureStorage','hostVersion':'4.1.1.1','hostInstanceId':'6e4f9e98-a788-4e21-abba-377655c17d0c','level':'Error','levelId':4,'processId':18,'eventId':122,'eventName':'PartitionManagerError'}",03/18/2023 19:39:38

2023-03-18T19:39:38.838020397Z MS_FUNCTION_AZURE_MONITOREVENT 2,dev-agg-curve-balancer-fn.azurewebsites.net,Microsoft.Web/sites/functions/log,FunctionAppLogs,,"{'appName':'dev-agg-curve-balancer-fn','roleInstance':'e5cdd3a2b3ea6b1bab30cd486062b2229f312de4effacae6c0dc28fc4821902f','message':'devaggcurvebalancerfn-applease: Error in AppLeaseStarter task. Exception: System.InvalidOperationException: AppLeaseManager has already started\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.StartAppLeaseAsync() in //src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 245\n at DurableTask.AzureStorage.Partitioning.AppLeaseManager.AppLeaseManagerStarter(CancellationToken cancellationToken) in /_/src/DurableTask.AzureStorage/Partitioning/AppLeaseManager.cs:line 140','category':'DurableTask.AzureStorage','hostVersion':'4.1.1.1','hostInstanceId':'6e4f9e98-a788-4e21-abba-377655c17d0c','level':'Error','levelId':4,'processId':18,'eventId':122,'eventName':'PartitionManagerError'}",03/18/2023 19:39:38

gegetoth commented 1 year ago

None of the previous approaches solved the issue like killing the 18 PID process. Falling back to using the extension bundle seems to be the solution.

davidmrdavid commented 1 year ago

Thanks for the update @gegetoth. It appears the private package might have triggered a null-reference error. I'm patching that shortly. In the meantime, two questions: 1) Did these errors start immediately after deploying the change? Was that the case for every app you deployed this private package to? 2) Just to confirm - not using the latest private package allows you to continue using your previous manual mitigation strategy, is that correct? I realize this approach is not perfect, but I want to understand whether you were able to get out of this error loop by removing those packages.

Thanks you

davidmrdavid commented 1 year ago

I will also be providing these same questions through the official support channel.

davidmrdavid commented 1 year ago

@gegetoth: just to confirm, did you manage to get "overridableInstances" working on your host.json?

I just confirmed I can replace an existing "Pending" instance if I set my host.json to look like this:

{
  "version": "2.0",
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "excludedTypes": "Request"
      }
    }
  },
  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[3.*, 4.0.0)"
  },
  "extensions": {
    "durableTask": {
      "storageProvider": {
        "type": "AzureStorage"
      },
      "hubName": "myownhub2",
      "overridableExistingInstanceStates": "AnyState"
    }
  }
}

The main setting you need to add is "overridableExistingInstanceStates": "AnyState", and you can test it locally. Please let me know, thanks

gegetoth commented 1 year ago

Hi @davidmrdavid,

I didn't try it because my understanding is it would overwrite any state event the Running one. So at the end it would disable the whole singleton functionality which was the reason why we decided to use the durable functions.

Please correct me if I misunderstand the functionality of that attribute.

davidmrdavid commented 1 year ago

Hi @gegeoth:

Yes, the setting will allow you to replace the orchestrator irrespective of its runtime state (Pending, Running, Completed, etc). However, you could do some extra validation on your end (e.g checking the instance status) to try to ensure orchestrators in the Running state are not replaced. This is not a perfect solution a the orchestrator may enter the Running state right after your check, but this could be acceptable in some workloads. In any case, just wanted to make sure you were aware that is a possibility.

gegetoth commented 1 year ago

Hi @davidmrdavid ,

I am trying the useAppLease: false flag. Will let you know if it changes anything.

I noticed that when there are messages in the storage queue I can not query the messages on the portal. It lists 0 from all messages:

Can you tell me what does it mean when the message is in the queue but I can not list it?

cgillum commented 1 year ago

@gegetoth this means that your app has "locked" these messages in order to process them. In Azure Storage, an app "locks" a message by making them invisible. That's why in the screenshot you aren't able to see the 12 messages that are in the queue.

gegetoth commented 1 year ago

Hi @cgillum,

Thank you for your response.

When the app should release the locks? How does the message get visible again if a entity fails during its run? I can see that the number of the invisible messages increased since I took that screenshot.

cgillum commented 1 year ago

Normally the “locks” on these messages are never released, but instead the message is deleted when the processing is complete. Even an entity failing with an error counts as a success, so the message will be deleted. Only in the event of a major failure, like a process restart, will the message become visible again (it takes 5 minutes before the default visibility timeout expires).

If the number of queue messages increases, it means there are more events to be processed by your orchestrations and/or entities.

gegetoth commented 1 year ago

In general the orchestrator should not run more than 2 minutes but for some reason there are runs which takes way longer like 6min or 10+min. So in this scenario because the runtime is longer than the visibility timeout the message will be visible again and it will be picked up by an entity again, right? So that is the reason why I can see in the storage account control queue that the dequeue count goes up to 10 or more I guess.

Still trying to figure out why some runs end up running way longer than expected: Could you help me to understand what happens after the state is set to started and the execution of the function's code?

I have noticed that the default setting for LockAquisitionTimeout in the singleton block in the host.json has a quite big time set:

Is that the expected default value?

gegetoth commented 1 year ago

Here is an example for a long running one according to the records in the storage account:

But I can't find the the same execution time on portal under invocations. The one which has the same start time 17:37:00 has the duration of 230ms.

Until the point it was completed in the storage account table it showed Pending. The next invocation went through and started the execution I can see the logs in the stream but I wonder what does the total delay means there? Where does it come from?

cgillum commented 1 year ago

The discrepancy you're seeing might be because there are multiple processes involved which track time differently. For Python, there is the functions host process, which is where the triggers run, and then there's the python process, which is where your code executes.

[functions_host_process] --> (gRPC) --> [python_process]

The process for invoking a function involves the following steps:

The functions host process receives a message from one of the task hub queues.
The functions host process invokes the python worker to execute the corresponding function code
The functions host receives the result of the function execution and updates the orchestration history accordingly.

For your specific example, steps 1-3 are taking 17 minutes. The 230ms is probably the time that your function code actually runs. The remaining time is most likely time spent waiting for a Python thread to come available inside the python process to execute your code. Depending on your version of Python and your app configuration, you might be running on a single-threaded Python worker, in which case any concurrent operations will be blocked waiting for the single Python thread to finish executing other function executions.

This is actually a really common problem that's documented here. There are several workarounds that you could consider:

Increase the number of python threads as described in the article I linked to.
Increase the number of python worker processes, also described in that article.
Decrease the function concurrency in your app to match the python concurrency by changing the Durable Functions-specific concurrency throttles described here and here. Decreasing concurrency allows orchestration and activity invocations to be more effectively load-balanced to other app instances, and can help create a signal to the scale infrastructure (if you're using the Consumption or Elastic Premium SKUs) that your app needs to scale out.

Try adjusting some of these configuration values to see if that improves the reported end-to-end latencies of your functions.

gegetoth commented 1 year ago

@cgillum thank you for the detailed explanation. I have already tried fine tuning the functions but will re-visit although I am not sure if the issue because of the busy threads. One of the function which is currently stuck in Pending has no workload at all. There are 2 processes with 2 threads each and those doesn't seem busy.

davidmrdavid commented 1 year ago

Hi @gegetoth:

Just for clarity, there's two different kinds of stuck orchestrations being discussed in this thread.

The first case, which Chris' suggestion tackles, refers to a performance tuning issue: your orchestrator is "waiting" to have a worker available to execute on. These orchestrators may take a few extra minutes to get into the "Running" state, but they should eventually execute.

The other issue, which I've been investigating, is the one where your application is getting stuck during a normal shutdown operation. In this scenario, your app is truly stuck and will require manual intervention (such as restarting the app / terminating the process) to recover. These are the cases where you'll see an orchestrator stuck in the Pending state for hours / days.

The useAppLease-based mitigation is to address the second issue. I'm curious to know if you've found that disabling the app lease feature appears to prevent the second scenario: orchestrators that appear to be stuck indefinitely. Please let me know and if the mitigation did not work, please give me an example instanceID of such an orchestrator, the app it ran on, and the time range in UTC for which it was stuck. Thanks!

gegetoth commented 1 year ago

Hi @davidmrdavid ,

I made the change and deployed. I will monitor the functions.

microsoft-github-policy-service[bot] commented 10 months ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

Azure / azure-functions-durable-python

DurableOrchestrationClient.start_new raises exception when orchestrator instance exists with the same instance id instead of silent replace. #410