Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
717 stars 271 forks source link

Netherite backend: Client request was cancelled because host is shutting down. #2603

Closed vany0114 closed 6 months ago

vany0114 commented 1 year ago

Description

@davidmrdavid as discussed in the https://github.com/Azure/azure-functions-durable-extension/issues/2534 I'm opening another issue to troubleshoot this other issue since it is related as it is a consequence/side effect of the mitigation that avoids an orchestrator gets stuck for hours locking an entity.

Expected behavior

Signal the entity, raise the event, or start up a new orchestration normally.

Actual behavior

When the mitigation happens what I've seen is that when it tries to signal an entity (one of the ones we lock), raise an event or even start up new orchestrations, it fails with a System.OperationCanceledException that says: Client request was cancelled because host is shutting down. Any ideas?

We have a retry strategy in place with an exponential backoff that waits up to 2.5 secs, but it seems that the issue takes longer (because all those operations failed after 2.6 secs), since this is an Event Grid trigger, the EG retry strategy comes into play after that, but I would need to review whether the EG retries succeded or not because I'm not sure how long this issue might take, as you can see when that error appeared there were intermittent OperationCanceledException errors around ~9 hours.

Known workarounds

None

App Details

Screenshots

image

image

If deployed to Azure

vany0114 commented 1 year ago

Just for clarity, when you say "I've noticed it too", what does "it" refer to? Does it refer to the mitigation working

Hi @davidmrdavid yes, that's what I was referring to, the worst latency I've seen locking an entity is ~130 secs.

However, if this issue seems different from the original problem described in this thread, it would be best for clarity if you could create a new ticket for it!

It's a different error but it's a side effect/consequence of the original one (in previous messages in this thread I've brought it up), anytime we were experiencing the orchestrator getting stuck this other issue always showed up, that was preventing us from creating new orders since we weren't able to start new orchestrations or signaling lock entities, so it seems like that when the mitigation is happening we have this other intermittent error.

Some operation ids:

Error details:

System.OperationCanceledException: Client request was cancelled because host is shutting down.
   at DurableTask.Netherite.Client.PerformRequestWithTimeout(IClientRequestEvent request) in /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:line 259
   at DurableTask.Netherite.Client.SendTaskOrchestrationMessageBatchAsync(UInt32 partitionId, IEnumerable`1 messages) in /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:line 511
   at DurableTask.Netherite.NetheriteOrchestrationService.DurableTask.Core.IOrchestrationServiceClient.SendTaskOrchestrationMessageAsync(TaskMessage message)
   at DurableTask.Core.TaskHubClient.RaiseEventAsync(OrchestrationInstance orchestrationInstance, String eventName, Object eventData) in /_/src/DurableTask.Core/TaskHubClient.cs:line 695
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.SignalEntityAsyncInternal(DurableClient durableClient, String hubName, EntityId entityId, Nullable`1 scheduledTimeUtc, String operationName, Object operationInput) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 344
   at Polly.AsyncPolicy.<>c__DisplayClass40_0.<<ImplementationAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func`3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates`1 shouldRetryResultPredicates, Func`5 onRetryAsync, Int32 permittedRetryCount, IEnumerable`1 sleepDurationsEnumerable, Func`4 sleepDurationProvider, Boolean continueOnCapturedContext)
   at Polly.AsyncPolicy.ExecuteAsync(Func`3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext)
   at Curbit.Orders.Saga.Triggers.ProviderOrderProgressedIntegrationEventHandler(EventGridEvent eventGridEvent, IDurableEntityClient starter, ILogger logger) in D:\a\1\sourceRepo\src\dotnet\saga\Curbit.Orders.Orchestrators\Triggers.cs:line 44
   at Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker`2.InvokeAsync(TReflected instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs:line 20
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 581
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 527
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108

Please let mew know if that information helps.

vany0114 commented 1 year ago

@davidmrdavid this is getting worse, while this is intermittent it's happening more often than before, yesterday there were a bunch of these errors throughout the day, as you can see below.

image

As you can see the incoming vs the outgoing messages don't match in the Event Hub. image

And orchestration latency increased from ~3 secs to ~100 secs. image

davidmrdavid commented 1 year ago

Thanks for creating this issue, Im sharing this internally so it gets an owner as im unfortunately personally not available for the rest of this week.

vany0114 commented 1 year ago

@davidmrdavid really appreciate it! I'll stay tuned.

sebastianburckhardt commented 1 year ago

it fails with a System.OperationCanceledException that says: Client request was cancelled because host is shutting down. Any ideas?

This error message is indicating that the entire host on which this client is running is shutting down and can therefore not process any more requests. The intended way to handle this situation is to retry the request on a different host. In particular, a local retry loop does not help - the request has to be retried on a different host.

Based on your description, I would assume that the EG trigger retry should take care of that.

vany0114 commented 1 year ago

@sebastianburckhardt Thanks for the explanation, yes I've reviewed some of the affected requests and indeed the EG trigger retry is doing its thing. On the other hand, do you know why that error is happening that often?

davidmrdavid commented 1 year ago

Hey @vany0114, I wanted to provide an update here.

It appears to me that, in most cases, when you receive the error "Client request was cancelled because host is shutting down.", the VM where this error was emitted previously logged the following message: "A function timeout has occurred. Host is shutting down.".

In other words, the timeline of events seems to be as follows:

  1. a given Azure Function, such as OrderDeliveredIntegrationEventHandler starts executing
  2. A minute later, the function times out
  3. The Host starts to shut down in response to the timeout. This is because a timeout may suggest that the VM is unhealthy, so we restart it for proactive recovery's sake.
  4. While shutting down, any other in-progress functions will fail with the error "Client request was cancelled because host is shutting down"

So, to answer your question: you are receiving the "client request cancelled" errors because some azure function invocations are timing out, which in turn triggers a VM restart.

So now the question becomes: why are function invocations timing out?. I don't know this for certain, but I have some theories that are partially backed by our telemetry.

In particular, I see that during periods of high failure, your app is exceeding the 'healthy' CPU usage threshold. In particular, it's using over 90% of the CPU available on your VMs. As you can imagine, this can slow down processing, which can in turn lead to timeouts. Furthermore, on our other thread, I theorized that perhaps high CPU utilization was to blame for the periodic hangs in the "compaction" step of checkpointing. I also see evidence of compounding effects: the timeouts cause VMs to restart, which create a backlog of processing, which means the backlog needs to be consumed on the next VM, which creates delays for new requests, which can cause new timeouts and thus a cycle.

At this point, I think it would be ideal to provide your app with more CPU capacity, which I think would help stabilize your app. I think there's a good chance this will help. I don't know for certain that this is the root cause of the timeouts, but so far I don't see any logical errors on our telemetry that explain the problem otherwise so I think this is our next best bet. I'm also hesitant to provide a new private package with a more aggressive mitigation (i.e one that fires more quickly), as the mitigation is to terminate processing for a durable partition on a given VM, which adds some delay to processing and therefore risks exacerbating the timeouts.

@vany0114: Are you able to upgrade/scale-up your application to utilize more powerful VMs? In particular, VMs with more CPUs capacity (e.g # of cores) should help improve the stability of your app. It's not necessary to go to the highest plan, just a higher one should help. Again, I don't know for certain that this will make the problem disappear, but it should remove one of the sources of instability I'm seeing. I also recommend proactively monitoring your CPU utilization to ensure it remains under healthy limits.

vany0114 commented 1 year ago

@davidmrdavid Thanks for the detailed update! I'll move to the premium plan then because right now we're using a consumption plan.

vany0114 commented 1 year ago

@davidmrdavid I'm reviewing this issue, particularly in our dev environment which is where we get this issue so often, as you can see almost all requests are failing because of the timeout and the shutting down issues.

image

I am bringing this up because the traffic in our development environment is not as high as in production, actually, it's very low.

Development: image

Production: image

In terms of infrastructure, the only difference is the throughput units that the Event Hub NS uses in dev vs production, which are 1 and 4 respectively, because both function apps (dev and prod) are deployed in a consumption plan.

So this behavior seems interesting to me because I wouldn't expect any issues in our development environment given the low requests number we process there, but it's the opposite, the behavior in our dev environment is actually worse than production, as a matter of fact, we reset it (func app, storage and Event Hub NS) last week and look at it now, it's totally unhealthy. For instance, look at the behavior of the EH messages.

image

BTW these are some insights in regard to memory and CPU consumption of the last 24 hours, there's indeed a high memory consumption during certain hours (which is something I'd like to review to try reducing allocation on our end) but not like the whole day is that high. On the other hand, the CPU seems healthy.

Memory consumption: image

CPU consumption: image

I hope this information helps you with the diagnostic.

davidmrdavid commented 1 year ago

Just saw this, @vany0114. I'll take a comparative look at the dev environment tomorrow.

davidmrdavid commented 1 year ago

In the meantime: any context on what could be causing those large memory allocations? Are you aware of any large allocations on your end, or is that surprising?

vany0114 commented 1 year ago

In the meantime: any context on what could be causing those large memory allocations? Are you aware of any large allocations on your end, or is that surprising?

@davidmrdavid Not sure yet, I need to look into it, but I think it might be the main input that the orchestration is created with (that's also persisted as part of the durable entity state, which may impact when it gets serialized/deserialized), but again I'm not sure I'll need to make a memory profile or something to get more insights.

vany0114 commented 1 year ago

@davidmrdavid today I redeployed the func app to move it to a premium plan and also redeployed the Event Hub NS to use again a standard tier, however I'm seeing this error a lot ever since:

An item with the same key has already been added. Key: durabletasktrigger-netherite-ordersagav3

image

image

Do you know what does it mean?

vany0114 commented 1 year ago

I had to go back to a consumption plan due to that issue FYI

davidmrdavid commented 1 year ago

Thanks @vany0114.

Regarding the error: "An item with the same key has already been added. Key: durabletasktrigger-netherite-ordersagav3".

Yes, I've seen that before. My understanding is that this is a transient error in the ScaleController that is safe to ignore, although noisy. Unfortunately, it cannot be fixed on the application-side, the ScaleController team needs to make a release for it to disappear.

Did you experience that error affecting the correctness of your app, or was it just polluting the Application Insights logs. Just confirming.

As for the test app: I took a preliminary look but I'm not entirely ready to share a conclusion there just yet. I'm starting an internal thread about it. I'll keep you posted.

vany0114 commented 1 year ago

Did you experience that error affecting the correctness of your app, or was it just polluting the Application Insights logs. Just confirming.

@davidmrdavid I initially thought it was just noise but I started seeing misbehavior, basically with an ack/event that the func app waits for, even tho the ack was being published timely the func app somehow was not processing it and it just timed out, so that's why I moved back to the consumption plan. I'm curious about why I don't see that error when the app is deployed on a consumption plan 🤔

Unfortunately, it cannot be fixed on the application-side, the ScaleController team needs to make a release for it to disappear.

I was told by the Azure support engineer that I could update the extensions package since it was already fixed but based on your answer it seems I cannot, so I'll wait until you give me the green light to do so, then I'll try again moving the func app to a premium plan.

vany0114 commented 1 year ago

@davidmrdavid I initially thought it was just noise but I started seeing misbehavior, basically with an ack/event that the func app waits for, even tho the ack was being published timely the func app somehow was not processing it and it just timed out, so that's why I moved back to the consumption plan. I'm curious about why I don't see that error when the app is deployed on a consumption plan 🤔

@davidmrdavid FWIW here is an example of an affected instance that I was referring to: {"siteUID":"24948821-5187-436c-a080-b0994b10d39e","checkNumber":"157","openedAt":"2023-10-10","courseNumber":"1"}

image

davidmrdavid commented 1 year ago

@vany0114: Regarding that ScaleController log, that will be fixed once their code takes on a dependency on a Netherite release with this fix (which I just approved): https://github.com/microsoft/durabletask-netherite/pull/316#pullrequestreview-1675312305 .

Again, my understanding is that this bug can be safely ignored, so I need to double check that affected instance you shared. Looking into that now. I could have been wrong.

I did look into your staging app. I don't have a clear picture of what's wrong with it yet but I do see a lot of OOMs. At first, I thought maybe you had some kind of persistent error loop due to trying to deserialize too large of a blob, but I see the app is slowly healing over the past day, which invalidates that theory. Did you do anything do your staging app in the past day to help it recover?

vany0114 commented 1 year ago

Did you do anything do your staging app in the past day to help it recover?

Yes, I had to reset it because our dev environment wasn't working because of that. The fact that the func app is sharing the same service plan with more apps has something to do? in our dev environment, most of the func apps are deployed in the same service plan FWIW.

davidmrdavid commented 1 year ago

@vany0114: ok, it's good to know it was reset, as that helps validate that my initial findings were probably not entirely off. I'll continue my investigation while ignoring the latest, healthier, data.

As an aside: There's an incoming Netherite release that I think should include the mitigation in your preview package, as well as the fix for the ScaleController error. I'll keep you posted on that.

Regarding your question on using the same app service plan: I think it's fine to do this.

vany0114 commented 1 year ago

@davidmrdavid I'd like to know your thoughts on what would be the next course of action to fix this problem definitively, this is hurting us really bad, and resetting the app every week is not scalable, for instance, it's just starting the day today and it's timing out...I've tried everything you have advised (which I do appreciate a lot) and the issue persists, so I don't really know what else to do to stabilize our app 😞

image

image

davidmrdavid commented 1 year ago

@vany0114: Yes, I want us to get your application to a better state soon. I'm also concerned about it's current state.

Here's my thoughts:

1) I'm confident that moving to more powerful VMs should partially help. Naturally, more CPU and more memory will reduce the chance of OOMs and high CPU consumption warnings. I realize that you were seeing errors in the Scale Controller when you did this; I believe that I can provide you with instructions to circumvent the errors you were seeing (even though I think the errors are benign) so I can work to get you those asap.

If you decide to scale up to Elastic Premium, please "reset your app"/ clear your taskhub after doing this. This will ensure that the newer VMs can start in a healthy state and, as such, be able to "keep up with the incoming work" instead of trying to do that while dealing with a pre-existing backlog.

2) Your app is shutting down VMs a lot. Our logs indicate that this is due to function timeouts. I think these shutdowns are creating compounding effects where: if a partition is busy, it will cause a timeout, which will cause partitions in that VM to fail to respond to requests, which will cause more timeouts in the client's VMs, and so on in a potentially self-sustaining cycle.

I think it's critical that we stop these timeout-induced shutdowns. @vany0114: can you change your code so that you manage the timeout yourself in the code instead of relying on the Functions runtime timeout setting? This way, when a request times out, it will not cause the VM to restart, which should help with the instability.

Addressing this will also make investigating easier for us. Right now, the timeout-induced shutdowns are extremely noisy, which make it difficult to identify the root problem. This is my main recommendation.

3) Can you please set the Netherite setting "TakeStateCheckpointWhenStoppingPartition" to false in host.json? This is a setting inside the storageProvider segment when you select Netherite as your backend. This may reduce the incidence of OOMs when your VMs are shutting down (due to timeouts) as we won't try to checkpoint partitions as part of shutdown, which is very memory intensive.

Those are my immediate thoughts with your current set up. Finally, I think it's key to make sure your application isn't trying to save moderately large state in Entities (or in any Durable APIs) as per this best practice: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-best-practice-reference#keep-function-inputs-and-outputs-as-small-as-possible . If you have reason to suspect your code may be violating this constraint, I think this would also be an area to invest it. If you have moderately large data being communicated through DF APIs, please use indirection to materialize them, as described in the linked doc.

vany0114 commented 1 year ago

Hi @davidmrdavid

Thanks for your thoughts, I think we can start then by scaling up to Elastic Premium, so I'll wait for your instructions to circumvent the errors re the Scale Controller.

Re handling the timeouts in the Event Grid triggers, absolutely that's something we can do, I just have a question, can I re-throw the timeout exception so that Event Grid can retry or should I re-throw a different exception? I'm asking because I don't know if by re-throwing the same timeout exception it will still cause the shutting down issue.

Edit: actually it will throw the Polly TimeoutRejectedException, so I think we should be good there.

Can you please set the Netherite setting "TakeStateCheckpointWhenStoppingPartition" to false in host.json? This is a setting inside the storageProvider segment when you select Netherite as your backend. This may reduce the incidence of OOMs when your VMs are shutting down (due to timeouts) as we won't try to checkpoint partitions as part of shutdown, which is very memory intensive.

Will do

As per the improvement to avoid saving large states in the Entities that's something that definitively we'll do, but it will take us some time since that's not a trivial refactor on our end.

vany0114 commented 1 year ago

I realize that you were seeing errors in the Scale Controller when you did this; I believe that I can provide you with instructions to circumvent the errors you were seeing

@davidmrdavid since I'm planning on making a release tomorrow can you provide please these instructions? the release will contain the timeout handling as well as the TakeStateCheckpointWhenStoppingPartition setting change FYI.

davidmrdavid commented 1 year ago

@vany0114: Essentially, I want to provide you with a private package that fixes the error through this PR but first I need to make sure the latest revisions to the mitigation I provided you pass our validation tests, and that's taking me more time than anticipated. Once I provide you with that package, you'll need to enable "runtime scale monitoring" which you can toggle on and off under: "Configuration > Functions runtime settings > Runtime Scale Monitoring". This should effectively make it so that the Netherite package used to make scaling decisions is the one in your app, not the one referenced by the Scale Controller component. However, I'm still working on that private package, so this step will have to wait.

In the meantime, please do deploy the timeout handling and the setting change nonetheless, that should help independently. Once you do, please let me know so I can monitor that the effect took place.

davidmrdavid commented 1 year ago

All that said, as I mentioned before, our understanding is that this ScaleController error is benign / can be safely ignored. It was originally reported here. From inspecting the code, and also from the other user reports in the aformentioned thread, I'm confident that this error should not prevent your app from scaling correctly:

So while I can provide you with instructions to circumvent the error (it's just going to take me a bit more time), I think the error should have no further negative effect on your app. Therefore, I would still recommend to scale up and ignore this error for the time being.

vany0114 commented 1 year ago

In the meantime, please do deploy the timeout handling and the setting change nonetheless, that should help independently

@davidmrdavid re ☝️ that, I want to bring this to your attention since it seems weird to me, even tho I'm handling the time-out errors I'm still seeing FunctionTimeoutExceptions and OperationCancelledExceptions and I would expect to see a TimeoutRejectedException which is the one throws by Polly when it times out. (This is in our development environment)

image

This is how it's implemented, as you can see the resilience strategy is only applied to the durable calls, it does not wrap for example the call to deserialize the object or the other call to the GetOrchestratorId which is only serializing its parameter.

private static readonly AsyncPolicy TimeoutStrategy = Policy.TimeoutAsync(TimeSpan.FromSeconds(30));

private static readonly AsyncPolicy RetryStrategy = Policy
    .Handle<ObjectDisposedException>() // The semaphore has been disposed.
    .WaitAndRetryAsync(5, retryNumber => TimeSpan.FromMilliseconds(500));

private static readonly AsyncPolicy ResilienceStrategy = Policy.WrapAsync(TimeoutStrategy, RetryStrategy);

[FunctionName("ProviderOrderPlacedIntegrationHandler")]
public static async Task ProviderOrderPlacedIntegrationHandler([EventGridTrigger] EventGridEvent eventGridEvent, [DurableClient] IDurableOrchestrationClient starter, ILogger logger)
{
    var @event = JsonConvert.DeserializeObject<ProviderOrderPlacedIntegrationEvent>(eventGridEvent.Data.ToString());
    await StartInstance(starter, @event, Constants.ORDER_ORCHESTRATOR, GetOrchestratorId(@event.SourceIdentities.Order), logger);
}

private static async Task RaiseEventAsync<T>(IDurableOrchestrationClient context, T integrationEvent, string instanceId, string eventName, ILogger logger)
{
    try
    {
        await ResilienceStrategy.ExecuteAsync(() => context.RaiseEventAsync(instanceId, eventName, integrationEvent));
    }
    catch (ArgumentException ex) when (ex.Message.Contains("No instance with ID"))
    {
        logger.LogWarning(ex.Message);
    }
    catch (InvalidOperationException ex) when (ex.Message.Contains("Cannot raise event"))
    {
        logger.LogWarning(ex.Message);
    }
}

private static async Task StartInstance<T>(IDurableOrchestrationClient context, T data, string orchestratorName, string instanceId, ILogger logger)
{
    try
    {
        await ResilienceStrategy.ExecuteAsync(() => context.StartNewAsync(orchestratorName, instanceId, data));
        logger.LogInformation($"Orchestrator {orchestratorName} started, instance id: {instanceId}.");
    }
    catch (Exception ex) when (ex.Message.Contains("already exists"))
    {
        // Catches these kind of errors: An Orchestration instance with the status {x} already exists.
        logger.LogWarning(ex.Message);
    }
}
davidmrdavid commented 1 year ago

Taking a look now btw ^. I assume this code worked locally?

vany0114 commented 1 year ago

Taking a look now btw ^. I assume this code worked locally?

I was taking a look and it seems I had to use a pessimistic TimeoutStrategy since the durable apis does not provide a way to pass a cancellation token

davidmrdavid commented 1 year ago

@vany0114: thanks. So just to clarify: did you get this to work locally with the pessimistic timeout strategy? I tried your previous snippet and it was not working as expected locally for me. I'm not terribly familiar with Polly, so I'm also researching that API.

davidmrdavid commented 1 year ago

Ok I tested locally with a pessimistic timeout strategy and can confirm that this had the desired effect of throwing an exception and ending the function invocation.

@vany0114: were you able to deploy this and see if it had an effect on reducing your timeout-induced restarts? Also, have you scaled up to Elastic Premium for more VM CPU and memory? Please let me know so I can monitor your app's behavior and health.

Finally, I also wanted to call out that we finally merged the mitigation we provided you into the main branch of the Netherite package, here: https://github.com/microsoft/durabletask-netherite/pull/301/files .

With that, I should be able to provide you with a package that can circumvent those Scale Controller errors, which I again I recommend to ignore for now and to scale up regardless since they are benign from our understanding. Apologies for the delay on that, we've had some unexpected blockers preventing us from moving faster there. I need to confirm all the blockers are dealt with still, I don't want to provide you with a faulty package.

Finally, I wanted to call out that you can open another support ticket with us if you need to escalate further in a more official manner. If you do open one, it will probably still be me on the other end once the ticket is escalated to the PG team, but please do know that I'm treating this already with the same priority as an official support ticket. Still, I want to call out you have that option.

vany0114 commented 1 year ago

@davidmrdavid Thanks for the update!

were you able to deploy this and see if it had an effect on reducing your timeout-induced restarts?

I did but in our development environment, however, I don't see any improvement, I mean the OperationCancelledException is almost gone and the FunctionTimeoutException is completely gone, however, it's almost always timing out (after 30 secs which is the timeout I'm using), meaning it's not able to start new instances, raise events, etc.

image

Also, have you scaled up to Elastic Premium for more VM CPU and memory?

Not yet as I'd like to wait for the next mitigation package.

I again I recommend to ignore for now and to scale up regardless since they are benign from our understanding.

I'd rather wait for the new package since the last time I scaled up it was causing issues and weird unexpected behavior, maybe I'm paranoid but since this is critical I don't want to take any chances.

Finally, I wanted to call out that you can open another support ticket with us if you need to escalate further in a more official manner. If you do open one, it will probably still be me on the other end once the ticket is escalated to the PG team, but please do know that I'm treating this already with the same priority as an official support ticket. Still, I want to call out you have that option.

Re the ticket, there's still one open, and yeah I'd really like to escalate this further since I think we need to get to the bottom of it, scaling up to a premium tier is acceptable for now on our end, but all of our infrastructure and part of our business model is heavily backed by the serverless infrastructure, so we need to be able to keep running this func app (more than any other one, this is core for us) using a consumption plan, of course, we can improve our implementation to better use compute resources but still.

BTW what is the maximum size or the recommended one to persist as part of the durable entity state? and also for the inputs to pass through the durable function apis?

davidmrdavid commented 1 year ago

Re the ticket, there's still one open, and yeah I'd really like to escalate this further

I see. Just for transparency - that ticket isn't currently on our queue (the product group / engineering team incident queue) so you may be able to escalate it further. We did have at some point, for the original thread, a corresponding ticket in our queue, but that one got resolved when we provided you with the mitigation package that solved that first issue. You could ask your support point of contact to escalate again to the product group.

were you able to deploy this and see if it had an effect on reducing your timeout-induced restarts?

I did but in our development environment, however, I don't see any improvement, I mean the OperationCancelledException is almost gone and the FunctionTimeoutException is completely gone, however, it's almost always timing out (after 30 secs which is the timeout I'm using), meaning it's not able to start new instances, raise events, etc.

I'll take a look in a few minutes. I assume this is orders-sagas-dev.

BTW what is the maximum size or the recommended one to persist as part of the durable entity state? and also for the inputs to pass through the durable function apis?

As a broad guideline, try to keep inputs and outputs to DF APIs under 45 KB. In the Azure Storage backend, this is when we start considering inputs to be "large". We can broadly apply the same guideline to DF Entities.

vany0114 commented 1 year ago

I see. Just for transparency - that ticket isn't currently on our queue (the product group / engineering team incident queue) so you may be able to escalate it further. We did have at some point, for the original thread, a corresponding ticket in our queue, but that one got resolved when we provided you with the mitigation package that solved that first issue. You could ask your support point of contact to escalate again to the product group.

Will do.

I'll take a look in a few minutes. I assume this is orders-sagas-dev.

Yep, but we had to reset it again because our development environment was totally down because of that. we took the opportunity to rename it, now it's called: func-orders-saga-dev

As a broad guideline, try to keep inputs and outputs to DF APIs under 45 KB. In the Azure Storage backend, this is when we start considering inputs to be "large". We can broadly apply the same guideline to DF Entities.

Same applies for Netherite backend?

davidmrdavid commented 1 year ago

Same applies for Netherite backend?

Yes. But I want to be clear that this is just a broad guideline. There really isn't a precise number that makes data "too big". The reality is that it's a function of your application's throughput relative to its inputs. But for a broad guideline, you can use this scale:

small = ~< 30kb, medium is ~<= 100kb, otherwise the inputs are large.

davidmrdavid commented 1 year ago

@vany0114: When deploying func-orders-saga-dev, did you ensure to keep 1 taskhub per EventHubs namespace, as we discussed here?

I don't see any OOMs anymore, and the rate of restarts seems fairly under control.

However, I see a lot of errors of the form: "EventHubsProcessor partitions/<..some partition Id> received packets for closed processor, discarded". I'm just curious if perhaps we're seeing another instance of the error in the aforementioned thread ^.

vany0114 commented 1 year ago

@vany0114: When deploying func-orders-saga-dev, did you ensure to keep 1 taskhub per EventHubs namespace, as we discussed here?

@davidmrdavid Yes, we're using only one, it's called ordersagav3

I don't see any OOMs anymore, and the rate of restarts seems fairly under control.

In the new func app (func-orders-saga-dev) of course, it's a brand new one, we reset everything, including the Event Hub NS, the previous one: orders-saga-dev was totally unhealthy as I showed you here

However, I see a lot of errors of the form: "EventHubsProcessor partitions/<..some partition Id> received packets for closed processor, discarded". I'm just curious if perhaps we're seeing another instance of the error in the aforementioned thread ^.

Hmmm hard to tell because the orders-saga-dev func app does not exist anymore, we removed it.

vany0114 commented 1 year ago

@davidmrdavid FYI I just spotted this (the Client request was cancelled because host is shutting down issue again) in the dev environment which is weird since everything is brand new 🤔

image

I still don't understand why in dev it gets unhealthy that often and so quick.

davidmrdavid commented 1 year ago

@vany0114: According to our logs, I see the EH NS "order-saga-workaround-v2-eh-dev" being used for both orders-saga-dev and func-orders-saga-dev. Just to be triple sure: are we certain the namespace is completely new? Unless I'm misunderstanding these logs, it would appear it was used in the old app was well.

davidmrdavid commented 1 year ago

@vany0114:

https://github.com/Azure/azure-functions-durable-extension/issues/2603#issuecomment-1773523590

Looking at this now.

vany0114 commented 1 year ago

@vany0114: According to our logs, I see the EH NS "order-saga-workaround-v2-eh-dev" being used for both orders-saga-dev and func-orders-saga-dev. Just to be triple sure: are we certain the namespace is completely new? Unless I'm misunderstanding these logs, it would appear it was used in the old app was well.

Yep, it is new, I mean, we removed it, and then we created another one using the same name.

davidmrdavid commented 1 year ago

FYI I just spotted this (the Client request was cancelled because host is shutting down issue again) in the dev environment which is weird since everything is brand new 🤔

Regarding this. The log "Client request was cancelled because host is shutting down issue" isn't necessarily an error. If your host is shutting down for whatever reason, this will be logged ocassionally. The real problem is why the host would be shutting down.

Previously, you were seeing this a lot due to timeouts. I no longer see timeout-induced restarts. I see the occasional transient failure that causes a restart (and therefore client request cancellations) but I'm not certain this is the right log signal anymore.

Yep, it is new, I mean, we removed it, and then we created another one using the same name.

I see. It's definitely confusing telemetry-wise because I see logs saying the EH is re-used, and the name is the same across two apps, but apparently the underlying resource isn't. In any case, we have a mechanism to throw away messages from the wrong taskhub, so this doesn't fully explain the issues you're seeing.

vany0114 commented 1 year ago

FWIW it was just a spike, but again, that's the new func app, the old one, was constantly shutting down even after the handling-timeout change.

image

davidmrdavid commented 1 year ago

@vany0114: the latest mitigation package is now on our ADO feed here.

It's Microsoft.Azure.DurableTask.Netherite version is 1.4.1-privatesc.2.

Irrespective of whether or not you decide to use Elastic Premium, please use this package over the previous mitigation package.

To circumvent that scale controller error you received in Elastic Premium. Please go to your app on the azure portal. Then go to "Configuration" > "Function runtime settings" > "Runtime scale monitoring" and set that to On. This will make the scale controller utilize your app's Netherite package to make scaling decisions, which will allow you to circumvent that error using this private package. Note that this means it is especially important that your application have sufficient CPU capacity to respond to scaling decision checks, but hopefully just upgrading to elastic premium should give you enough capacity.

Finally, looking at the recent rate of errors in your app (and in your last screenshot here) the app seems to be more stable as of ~5 hours ago. I'm no longer seeing the steady stream of messages from another taskhub being discarded. Please let me know if that doesn't match your observations and let me know if you experienced a substained decrease in app performance after moving to this new package and scaling up to Elastic Premium.

Please note that occasional "Client request was cancelled because host is shutting down" logs are a part of normal operation. However, if they are sustained, then they reflect a real problem.

For that reason, I think our key error signal moving forward should be whether or not your manual timeout fires or not. If you experience a sustained increase in polly-managed timeouts, then that reflects a problem in Netherite. Otherwise, you may just be experiencing transient failure.

vany0114 commented 1 year ago

@davidmrdavid Thanks for the detailed explanation, FYI tomorrow I'll make a release that will include that package version and also the movement to the Elastic Premium tier (only for production, the func app in the development environment will stay using a consumption tier) so that we can see how it behaves 🤞 so I'll keep you posted.

Would it be possible to know what enhancements/fixes the 1.4.1-privatesc.2 version contains? (I guess is the dictionary key error)

FWIW the "Runtime scale monitoring" option was enabled the first time I tried to scale it up.

davidmrdavid commented 1 year ago

Thanks @vany0114.

There's a few fixes in 1.4.1-privatesc.2, here goes a list of them:

vany0114 commented 1 year ago

@davidmrdavid just wanted to let you know that it's been two days since the release and everything is working as expected, still too soon to say that the issue is gone since it used to appear after one or two weeks, but I'll keep an eye on it and keep you posted.

BTW I did scale up to an Elastic Premium tier and the func app is called now: func-orders-saga

Thanks for all your help!

vany0114 commented 1 year ago

For that reason, I think our key error signal moving forward should be whether or not your manual timeout fires or not. If you experience a sustained increase in polly-managed timeouts, then that reflects a problem in Netherite. Otherwise, you may just be experiencing transient failure.

Hi @davidmrdavid, unfortunately, the manual timeout is firing again, it started happening on Sunday around 6 a.m. UTC and as a result, we've been missing orders, etc, so as you know that's very critical for us.

image

Please let me know what would you need from me and what are the next steps.

davidmrdavid commented 1 year ago

Looking now. @vany0114: is this func-orders-saga?

vany0114 commented 1 year ago

Looking now. @vany0114: is this func-orders-saga?

@davidmrdavid Correct