Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

Netherite backend: Client request was cancelled because host is shutting down. #2603

Closed vany0114 closed 4 months ago

vany0114 commented 1 year ago

Description

@davidmrdavid as discussed in the https://github.com/Azure/azure-functions-durable-extension/issues/2534 I'm opening another issue to troubleshoot this other issue since it is related as it is a consequence/side effect of the mitigation that avoids an orchestrator gets stuck for hours locking an entity.

Expected behavior

Signal the entity, raise the event, or start up a new orchestration normally.

Actual behavior

When the mitigation happens what I've seen is that when it tries to signal an entity (one of the ones we lock), raise an event or even start up new orchestrations, it fails with a System.OperationCanceledException that says: Client request was cancelled because host is shutting down. Any ideas?

We have a retry strategy in place with an exponential backoff that waits up to 2.5 secs, but it seems that the issue takes longer (because all those operations failed after 2.6 secs), since this is an Event Grid trigger, the EG retry strategy comes into play after that, but I would need to review whether the EG retries succeded or not because I'm not sure how long this issue might take, as you can see when that error appeared there were intermittent OperationCanceledException errors around ~9 hours.

Known workarounds

None

App Details

Screenshots

image

image

If deployed to Azure

davidmrdavid commented 11 months ago

@vany0114: Just a short update: I've identified "partition 11" as the currently struggling component of your application, which seems responsible for the timeouts at this time. I'm trying to identify why exactly it failing, but it appears to be failing to restore its last checkpointed state from storage.

I need assistance from the larger team to dive deeper and maybe even more logs than we have today, but in the meantime I currently have no reason to suspect this will recover on its own. If you're still seeing timeouts, "resetting your app" as you've been doing in the past would be a good short term step to restore stability. I realize that's not a long term solution of course, but I just want to get your app back to a healthy state while we investigate this latest incident.

davidmrdavid commented 11 months ago

@vany0114:

I think to debug this further, we need to increase the verbosity of your logs, which should give us better statistics into your storage accesses, the size of data being read, etc. Our main theory at this point is that, for partition 11 in this last reproduction, something being loaded from storage had gotten "too big" but we don't have enough visibility into what exactly this is. Can you please modify your host.json to enable low level logs as follows:

{
  "extensions": {
    "durableTask": {
      "storageProvider": {
        "StorageLogLevelLimit": "Trace"
       }
    }
  }
}

Through this, we should be able to get more clarity on this issue a bit better. Unfortunately, if you already cleared your taskhub (which was the right thing to do from a mitigation standpoint) these logs won't be immediately helpful. But we should be able to proactively monitor them over the coming days.

I'm going to be looking to see if there's more we can uncover in the meantime. Additionally, just want to re-emphasize that ensuring that your inputs and outputs to DF APIs are "small" is likely to help here, which I know is a large refactor on your end, but I still want to emphasize its importance. Thank you for your help.

vany0114 commented 11 months ago

@davidmrdavid I just added this environment variable: AzureFunctionsJobHost__extensions__durableTask__storageProvider__StorageLogLevelLimit (the value is Trace as you suggested) please let me know when I can disable it since I guess it will dramatically increase our Application Insights costs.

davidmrdavid commented 11 months ago

@vany0114:

There should be no need for those logs to flow into your Application Insights logs.

These are the controls that send what gets logged to Application Insights: https://microsoft.github.io/durabletask-netherite/#/settings?id=parameters-in-the-global-logging-section And these are the controls that handle what gets logged to our own end: https://microsoft.github.io/durabletask-netherite/#/settings?id=logging-parameters-for-netherite

Are you seeing an excess of logs flowing into your Application Insights? If so, please undo this latest change. My recommendation was meant to not increase your Application Insights logs.

vany0114 commented 11 months ago

Are you seeing an excess of logs flowing into your Application Insights? If so, please undo this latest change. My recommendation was meant to not increase your Application Insights logs.

@davidmrdavid Hard to say right now, since usually it gets reflected the next business day, I'll confirm it in a couple of days, and I'll keep you posted.

vany0114 commented 11 months ago

Hi @davidmrdavid any update on this? I had to reset the func app on Sunday again. Seems like it only lasts stable about a week.

davidmrdavid commented 11 months ago

We were looking into this today. No updates yet due to competing investigations at least on my end, though our SME was looking at this. Looking to provide an update by tomorrow.

davidmrdavid commented 11 months ago

@vany0114: can you provide me the approximate timestamp when you had to "reset your app"?

davidmrdavid commented 11 months ago

@vany0114: I see that on 11/3/2023 2:45:00 PM UTC, some operation took place that removed the environment variable AzureFunctionsJobHost__extensions__durableTask__storageProvider__StorageLogLevelLimit. At around that time, from what I can tell, we stopped getting verbose logs for your app. This makes sense, that variable controls logging.

Do you recognize this operation? I noticed your app started experiencing timeouts again just 2 days later (around 11/05 ~14:20 UTC from my telemetry) which makes the timing very unfortunate.

vany0114 commented 11 months ago

@vany0114: I see that on 11/3/2023 2:45:00 PM UTC, some operation took place that removed the environment variable AzureFunctionsJobHost__extensions__durableTask__storageProvider__StorageLogLevelLimit. At around that time, from what I can tell, we stopped getting verbose logs for your app. This makes sense, that variable controls logging.

Do you recognize this operation? I noticed your app started experiencing timeouts again just 2 days later (around 11/05 ~14:20 UTC from my telemetry) which makes the timing very unfortunate.

Yes I removed that setting as per your comment

Re the time window when the timeout issues appeared again, that's correct.

davidmrdavid commented 11 months ago

@vany0114: just to make this explicit - were you able to confirm that a large influx of new logs was flowing into your Application Insights instance and that's why you removed this setting?

If that's the case, then we can work together to set up the right filtering rules to prevent the logs from making it to your Application Insights while also allowing us to use the verbose logs in our internal telemetry.

vany0114 commented 11 months ago

@davidmrdavid It seems they increased a bit however I'm not completely sure I can attribute it to that, but yeah let's configure the right filters as we don't want to incur any other extra costs due to this issue, this is already costing us time and money due to that func app not working reliably, not to mention that since we scaled it up to a premium tier it is costing us 100 USD per day and I feel it is not helping at all since we've been experiencing the timeout issue consistently every single week, actually I'm thinking of scale it down again.

Telemetry data: image

Fun app costs: image

davidmrdavid commented 11 months ago

Ok yes let's work to filter these out. What we want is to be able to catch this error with these verbose logs on, as that is what's going to show us more precisely your app is getting stuck.

First - let's validate that these logs were indeed making it to your Application Insights. Can you search for the substring "CommitLogEntry" in your Application Insights data during the past week? If you see it, then that confirms our verbose logs were making it to your Application Insights instance.

here's a few such sample logs:

TIMESTAMP | Details -- | -- 2023-11-01 22:16:23.2803595 | CommitLogEntry Appended 3318166780 FL length=5323 crc=F94DD96C 2023-11-01 22:16:23.4283766 | CommitLogEntry Appended 3318168344 FL length=1557 crc=7F85EACE 2023-11-01 22:16:23.4325124 | CommitLogEntry Appended 3318172168 FL length=3818 crc=81AB5BBD
vany0114 commented 11 months ago

Confirmed, there are no results.

image

davidmrdavid commented 11 months ago

Gotcha. I'll look for at least one more way to confirm that no verbose logs made it to your Application Insights, but this plus the difference in cost you've showed me makes me think that the logs probably did not make it to your instance. These logs are very verbose, somethings exceedingly so, so I would have expected to see a higher jump in price.

But let me try to obtain at least one more check to send your way. If that check also yields no results, then I would recommend we re-enable the verbose logs.

After that, I will also try to see if the small snapshot of verbose logs we obtained can show us something growing "too large" in the storage for partition 6 (the partition that failed to start up on Saturday).

davidmrdavid commented 11 months ago

@vany0114:

A few more substrings for you to check for:

If none of these is there. Then I'm confident the ultra noisy log category we enabled are not making their way to your Application Insights, and in that case I recommend re-enabling the verbose logs. What we want is to catch the error as it's happening with the verbose logs enabled, as with that we'll be able to pinpoint the piece of data (or at least the step) that's failing to load/complete.

As before, please notify me when you enable the logs (so I can validate) and when you encounter this error again (and its broad timerange in UTC) so I can search for these logs.

In terms of downgrading from Elastic Premium: If we're confident this is happening on a weekly cadence, I think it would be good to wait at least one more week to try to catch this error with the improved resource usage of Elastic Premium and the verbose logs enabled. I worry about adding noise again (from CPU and memory warnings), just as we're adding the verbose logs.

In the meantime - I want to emphasize again that my working theory is that working to reduce the memory pressure on Entities will help here. But obviously we'll be using the verbose logs to find any evidence of bugs or behaviors we can fix on our end.

vany0114 commented 11 months ago

@davidmrdavid Only found matches for the storage operation BlockBlobClient.Upload (WriteCommitLogMetadata) traces.

image

davidmrdavid commented 11 months ago

@vany0114: I think those might be false positives because those are perf warnings (so not "trace-level" logs, instead "warning" or "error" level) so I think you were probably getting those irrespective of the logs we enabled. Does that make sense? Do you find any non error/warning instances of this log? If not, then I think we're in the clear.

To be clear, there's a "successful" version of this log that doesn't get logged as a warning or error.

If this makes sense, can we re-enable the verbose logs?

vany0114 commented 11 months ago

@vany0114: I think those might be false positives because those are perf warnings (so not "trace-level" logs, instead "warning" or "error" level) so I think you were probably getting those irrespective of the logs we enabled. Does that make sense? Do you find any non error/warning instances of this log? If not, then I think we're in the clear.

To be clear, there's a "successful" version of this log that doesn't get logged as a warning or error.

If this makes sense, can we re-enable the verbose logs?

@davidmrdavid I just enabled it again, FYI

davidmrdavid commented 11 months ago

Thanks, can confirm the verbose telemetry is back

davidmrdavid commented 11 months ago

Unfortunately, so far I'm not able to get a clear signal from the logs without them being enabled on a live repro. If the issue is reproducing roughly once a week, I'm hoping we can catch the next repro with the verbose logs and with it pinpoint exactly what is getting stuck.

@vany0114: Any updates on trying to reduce the size of data sent over the DF APIs?

vany0114 commented 11 months ago

Unfortunately, so far I'm not able to get a clear signal from the logs without them being enabled on a live repro. If the issue is reproducing roughly once a week, I'm hoping we can catch the next repro with the verbose logs and with it pinpoint exactly what is getting stuck.

@vany0114: Any updates on trying to reduce the size of data sent over the DF APIs?

@davidmrdavid No updates on that unfortunately, however and for what it's worth I'm wondering why it used to work before we used this approach: https://github.com/Azure/azure-functions-durable-extension/discussions/2530 It used to be kinda spaghetti code (but apparently worked more reliably) that's why we took that approach, also to be able to reuse code not to mention that it's cleaner.

To be clear before we used the approach mentioned above we were using Storage Account as the backend, the difference was that we didn't use LockAsync, and it used to work better, we did experience the stuckness issue but few times during the year (which isn't ideal of course) not every week.

My point is that the data size of the messages wasn't a problem and the schema of those messages hasn't changed compared to what we have today.

davidmrdavid commented 11 months ago

@vany0114: I can't say with precision yet why it used to work before you implemented this refactor, but I have some observations that may provide clarity.

Simplifying this case a bit - assuming the issue were simply that your states / inputs / outputs are growing too large, preventing Netherite from eventually loading its partition state: the Azure Storage backend has a mechanism to turn large data into blob storage references, creating a layer of indirection that helps keep the size of data in your tables and queues moderately small. As per my understanding, we don't have an equivalent optimization in Netherite yet. So this is one lost layer of safety that could partially explain the loss in stability. That's not to say the Azure Storage backend isn't susceptible to instability due to large data, but the errors manifest differently as a result of this blob-backed indirection.

As for your refactoring to partial classes being to blame: I haven't seen evidence of the partial classes being to blame. If that were the case, I think we'd see non-determinism errors or something like that, but that's not the case. I think we can safely disregard this as a culprit for now.

Regarding the "stuckness" in Azure Storage: it's hard for me to comment on this because when we were investigating this, your app first went from using the new partition manager (which is in preview), to using the old one, and then to not using this backend at all. So I didn't get enough data points with this backend to really evaluate if your issues with it all had the same root cause, or not (it's possible you simply had different issues with very similar symptoms). If you wanted to change back to the Azure Storage backend if you have reason to think it's more stable for your workload, you could do that. Although if you decide to do this please note it would be resetting this investigation as the datapoints cannot be compared across backends, so there's a risk we'd go back to square one. If you do decide to do this, it would be great (on our end) to at least do this after we get to see the verbose logs during a reproduction of the error. I think we have a good chance of identifying the problem with these logs, or at least making a critical breakthrough.

You also mentioned on your last comment that your use of LockAsync is new. I had not caught that originally. Is your use of Entities also very recent? If you weren't using these entities before (and your app was stable), then that would be useful context for us to consider. The more context you can give me on this, the better. Thanks

vany0114 commented 11 months ago

You also mentioned on your last comment that your use of LockAsync is new. I had not caught that originally. Is your use of Entities also very recent? If you weren't using these entities before (and your app was stable), then that would be useful context for us to consider. The more context you can give me on this, the better. Thanks

@davidmrdavid no, the use of Entities is not new, actually, the schema hasn't changed.

davidmrdavid commented 11 months ago

Ok, so my understanding now is that only the use of LockAsync is new. Thanks for clarifying

vany0114 commented 10 months ago

Hi @davidmrdavid I think the error is starting to appear again, so I wanted to let you know so that you can take a look before it gets worse, since when that happens I will need to reset the app.

The timestamps that you see in the images below are in UTC FYI: image

image

Note: I left a question in the Azure support email thread asking about the easiest way to make a copy of the storage account linked to the func app so that you guys can get the data that you need for troubleshooting before I reset everything, still waiting for an answer.

davidmrdavid commented 10 months ago

@vany0114: thanks for the heads up. Looking now.

davidmrdavid commented 10 months ago

I took a look at your app and it doesn't seem to producing the error signals I usually expect (i.e a partition is fully unable to process requests), so I can't say that the error is reproduced yet.

I'll keep an eye out on it though. Are you finding this error stream to be sustained or was it a small surge that seems to have recovered?

vany0114 commented 10 months ago

@davidmrdavid I think it is recovering, but hard to say, I see that it is slowly increasing the error rate as days go by, this is the last 12 hours (time is in UTC):

image

image

I'm also seeing a new error, haven't seen it before:

Exception while executing function: ProviderOrderPlacedIntegrationHandler Netherite backend failed to start: Encountered error while fetching the list of EventHub PartitionIds Encountered error while fetching the list of EventHub PartitionIds A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

I hope this helps

vany0114 commented 10 months ago

Hi @davidmrdavid any update on this? I'd like to highlight that while the issue has been happening again since the 21st (https://github.com/Azure/azure-functions-durable-extension/issues/2603#issuecomment-1821841271) it's not as bad as it used to be, it's pretty much happening all day long but not that often.

Here some data from the last 24 hours (time is in UTC): image

image

davidmrdavid commented 10 months ago

just noting I've been this latest update - will take me a bit to respond due to urgent competing threads, but will look to provide you an update by tomorrow morning PDT at the latest.

davidmrdavid commented 10 months ago

@vany0114:

Just took a look.

In general, while your app is definitely seeing your Polly time out firing in the past few days - I'm still not seeing the specific failure mode I've been tracking: a partition becomes permanently unresponsive, causing an unproportional number of request to fail (all requests that are routed to that failed partition). This has not happened since you last reset your app. Clearly, when a partition becomes permanently unresponsive, the number of failed requests rises to the point that you deem it necessary to reset your app.

I want to call this out because I want to make sure we're aligned on how we're defining "the issue". So far, if a partition has not failed, I've been treating these errors as false positives. In the past, this was fine because you were able to reproduce a permanent partition failure with some regularity (weekly), but that doesn't seem to be the case as of late.

I can definitely try to dive deeper into these failure spikes, but so far I see they match closely with periods of high invocation counts on your app in general, and make up less than ~0.15% of executions (on average) so I'm trying to determine if these are affecting your app's stability (in which case we can dive further) or if you'd rather we focus specifically on the partition unavailability error (which forces you to reset your app), in which case we're still waiting for a repro with the verbose logs enabled.

davidmrdavid commented 10 months ago

In general - since I see your error spikes correspond to periods of high work volume, I think it makes sense that your durable clients may be under more pressure and therefore need more time to get a response, so I think the polly timeout may be too aggressive for those global execution spikes. I wonder if you'd see a decrease in polly errors if you increase your polly timeout duration (but make sure to correspondingly increase the functionTimeout parameter to avoid host restarts!) while still getting a response relatively quickly (we may be seconds away from a getting a response sent to those clients, it's just the partition could be busy due to the spike in invocations).

vany0114 commented 10 months ago

@davidmrdavid what would be the recommended timeout then? right now it is timing out after 30 seconds

davidmrdavid commented 10 months ago

@vany0114: I think we have to experiment. Can you make it 1 min an extend the functionTimeout config to 1 min 30 seconds? What we want to know is if it leads to a decrease in the failure percentage.

vany0114 commented 10 months ago

@vany0114: I think we have to experiment. Can you make it 1 min an extend the functionTimeout config to 1 min 30 seconds? What we want to know is if it leads to a decrease in the failure percentage.

@davidmrdavid this is done and live, I didn't change the functionTimeout as it was 10 minutes FYI but I think in the future I'm gonna change it to 2 minutes.

vany0114 commented 10 months ago

Hi @davidmrdavid, here's what I found after the last change you suggested:

Host is shutting down exception:

image

Polly timeout exception:

image

As you can see the "Polly timeout" exceptions did decrease (by ~88.5%) on the other hand, the "Host is shutting down" exception rate did increase, however, I think it does not necessarily mean it is a bad thing as it is happening like once or twice per day, it's just that when it happens it affects a bunch of on-going requests while it auto-heals and a new instance is ready to process new requests I guess.

vany0114 commented 10 months ago

Hi @davidmrdavid I've noticed that the "Host is shutting down" exception is happening more often now, during peak hours like every ~35 - 60 minutes, and the same with the "Polly timeout" exception.

Host is shutting down exception:

image

Polly timeout exception:

image

My biggest concern is that the latency has increased a lot, as you can see in the chart below it went from ~5 secs to ~50 secs. This makes me consider going back to Azure Storage Account as the backend since we're not getting what it is supposed from Netherite: more throughput and better response time...I remember the latency was ~30 secs when we were using Azure Storage Account. My only concern about moving back is that we experience the stuckness issue again, for better or for worse that issue was fixed in the private package that we're using.

image

davidmrdavid commented 10 months ago

Hi @vany0114 - sorry for the delay, I didn't see this update until this morning.

For me to dive deep, I'll need you to provide me a specific timerange in UTC when you see a spike in host shutdown and polly timeouts. It would also help me if you can provide me with a few instanceIDs that seem affected (if applicable).

Taking a broad look though - I do see several host restarts ocurring due to low memory in your VMs. I would be interested in decreasing your values of maxConcurrentActivityFunctions and maxConcurrentOrchestratorFunctions so that each VM tries to execute fewer of each at the same time, which would cause you to scale out further and reduce the memory pressure on each VM.

In terms of going back to the Azure Storage backend - I can't comment on the 'stuckness' issue because we were unable to make an in-depth investigation of what exactly happened there. However, the team is not aware of any chronic issue around stuck orchestrators in Azure Storage, so I would personally not be too concerned about that. It goes without saying that if you do decide to switch, and we encounter an issue, we'll investigate it and provide a patch like we did for Netherite.

vany0114 commented 10 months ago

Hi @davidmrdavid

For me to dive deep, I'll need you to provide me a specific timerange in UTC when you see a spike in host shutdown and polly timeouts. It would also help me if you can provide me with a few instanceIDs that seem affected (if applicable).

You can see the time range in UTC in the images I provided, for both the host shutdown and polly timeouts errors.

Taking a broad look though - I do see several host restarts ocurring due to low memory in your VMs. I would be interested in decreasing your values of maxConcurrentActivityFunctions and maxConcurrentOrchestratorFunctions so that each VM tries to execute fewer of each at the same time, which would cause you to scale out further and reduce the memory pressure on each VM.

I can definitely give it a try, what values do you recommend? I see the default is 10 for both, 5 maybe?

vany0114 commented 10 months ago

I can definitely give it a try, what values do you recommend? I see the default is 10 for both, 5 maybe?

@davidmrdavid This is live in production, I'm using 5 for both FYI

davidmrdavid commented 10 months ago

@vany0114: We took a look at the long latency issue and found a performance bug that may add up to ~30s to the recovery of a partition. The fix is simple and here: https://github.com/microsoft/durabletask-netherite/pull/335

I'll look to package this for you in a private package tomorrow. This may help a bit with some of the latency client latency issues (i.e your polly timeouts), but note it's not strictly tackling the "host is shutting down" alert.

davidmrdavid commented 10 months ago

I have released a new private package with the fixes in the PR above here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.2-private.1

The version is 1.4.2-private.1 and may be obtained from the same feed as before:

    <add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

We believe this should help reduce some of the latency issues, though as you notice the PR is not merged yet because we're still going through the logs to see if there are other optimization opportunities.

davidmrdavid commented 10 months ago

As for the host shutting down - in most cases, I see that your VMs are mostly shutting down due to low memory. It's not immediately clear where exactly all that memory consumption is coming from as your app and framework execute in a single process, but I think this is the metric we need to optimize. Let me try to see what we can do to determine where all this memory usage is coming from.

vany0114 commented 10 months ago

@davidmrdavid I just deployed the func app using the provided version, I'll keep you posted. Also, we've prioritized the refactor you suggested to avoid passing big inputs to the Durable Client APIs and use indirection instead, so we will start working towards that soon, that's something that we already had on our roadmap as an enhancement but it seems that by implementing that we will get rid of this issue and finally get stabilize the app?

davidmrdavid commented 10 months ago

@vany0114: Fantastic, thank you. I think reducing the memory pressure will certainly help so it's worthwhile investment. We'll continue looking on our end as well for optimization opportunities.

vany0114 commented 9 months ago

@davidmrdavid Just as FYI the response time improved after the deployment using the 1.4.2-private.1 version, however, the latency is still high as you can see, the highest used to be ~5 seconds and now is ~15 seconds.

image

We're going to start working this sprint on applying your suggestions in our implementation, I'll let you know once it's live in production to see how it behaves after that.

davidmrdavid commented 9 months ago

Thanks @vany0114. Can you please help me understand what we're comparing against when you say "see, the highest used to be ~5 seconds and now is ~15 seconds.". The 5 seconds is with a previous Netherite release, or with Azure Storage?

Edit: Nvm, I think I see this more clearly now on a second thought. You're saying that in early December the latency was ~5 seconds, then it jumped around Dec6th, and finally after upgrading to 1.4.2-private.1 it jumped down, correct?

vany0114 commented 9 months ago

Thanks @vany0114. Can you please help me understand what we're comparing against when you say "see, the highest used to be ~5 seconds and now is ~15 seconds.". The 5 seconds is with a previous Netherite release, or with Azure Storage?

It was using the previous package version, not sure what started happening on December 6th, but it is when the latency started increasing.

vany0114 commented 9 months ago

Edit: Nvm, I think I see this more clearly now on a second thought. You're saying that in early December the latency was ~5 seconds, then it jumped around Dec6th, and finally after upgrading to 1.4.2-private.1 it jumped down, correct?

Exactly.