Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
711 stars 263 forks source link

Orchestration getting stuck while getting the lock #2534

Open vany0114 opened 11 months ago

vany0114 commented 11 months ago

Description

The full description with code snippets, screenshots, issue samples, etc is here https://github.com/Azure/azure-functions-durable-extension/discussions/2530

Expected behavior

Acquire the lock in seconds at most, not minutes or hours.

Actual behavior

It seems the orchestration is getting stuck while acquiring the locks of the entities intervening in the orchestration.

Known workarounds

Reset the durable storage account and the func app storage account

App Details

If deployed to Azure

davidmrdavid commented 11 months ago

@vany0114: while we start this investigation, can you share with us when this issue first started (rough timestamp in UTC)? Does it correlate with any code changes (i.e partial classes, or trying out the new partition manager)?

vany0114 commented 11 months ago

@davidmrdavid It seems the issue started around 2023-08-07 13:34 UTC

davidmrdavid commented 11 months ago

Thanks! Any known code/config changes around that time? @vany0114?

vany0114 commented 11 months ago

@davidmrdavid Nope, we haven't touched anything since the last deployment on August 2nd which btw that was the last time the issue showed up and we had to reset everything.

vany0114 commented 11 months ago

@davidmrdavid BTW I'll re-deploy the func app today to try to mitigate this issue, but this time it will use Netherite as the storage provider, meaning that the old ordersagav2 storage account is still there for troubleshooting however the func app's storage account will be reset, would that be a problem for you guys while you look into?

davidmrdavid commented 11 months ago

@vany0114: that shouldn't be an issue on our end as our historical telemetry is stored independently of our app's live status, so you're good to go ahead :).

I'm not certain at this point in time that this is an issue with the storage provider itself, so I can't comment on whether changing to Netherite will help, but it probably won't hurt to try it out and may tell us more about the problem. Just an FYI.

vany0114 commented 11 months ago

Agree I'm not sure either, just wanna try something different this time, and as you said it may tell us more about the problem.

vany0114 commented 11 months ago

@davidmrdavid BTW if you guys can also take a look at this https://github.com/microsoft/durabletask-netherite/issues/294 would be great since that's related to the Netherite implementation.

davidmrdavid commented 11 months ago

A few small updates: (1) I don't see your app using the new table partition manager. I'm only calling this out because you mentioned this new feature in the discussion thread. Just to confirm: is this expected or did you take an action to try and use the new partition manager? Just triple checking the expectation.

(2) I'm not immediately spotting a partition manager-type issue. In other words, the queues that are hosting your orchestrator and entity messages seem to be receiving new messages just fine. So I'm starting to think the issue here might have to do with the entity message processing, not with receiving messages. Still need to validate this.

davidmrdavid commented 11 months ago

Actually, I'm seeing abnormally high message ages in "control queue 3". Let me drill into that. If that queue is hosting any of the entities you're trying to lock, maybe that's to blame.

vany0114 commented 11 months ago

(1) I don't see your app using the new table partition manager. I'm only calling this out because you mentioned this new feature in the discussion thread. Just to confirm: is this expected or did you take an action to try and use the new partition manager? Just triple checking the expectation.

Yeah I rolled back that change because I thought that was the culprit, but it wasn't, sorry I forgot to mention that.

davidmrdavid commented 11 months ago

Actually, I'm seeing abnormally high message ages in "control queue 3". Let me drill into that. If that queue is hosting any of the entities you're trying to lock, maybe that's to blame.

I'm not able to draw a correlation from this observation to the symptoms you experienced, so let's ignore this for now.

Also, I've been focusing on this set of data, from the other thread:

{"storeId":"615c29df-5e7c-458c-9486-ef9480d2bf3b","courseUid":"c77bb860-04f7-46d3-86d9-d8c0e65fe856","checkNumber":"90007"} same, the instance was stuck 21 minutes:

* @ticketnumberkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @originorderkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @scheduledkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @order@{"storeId":"615c29df-5e7c-458c-9486-ef9480d2bf3b","courseUid":"c77bb860-04f7-46d3-86d9-d8c0e65fe856","checkNumber":"90007"}

So that orchestrator is stuck from about ~11:33 to ~11:54, and I see in my telemetry that EntityId @scheduledkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1 is the last Entity to respond to that lock request, precisely at about ~11:54, whereas others responded around ~11:33. So the question became: why did it take us so long to acquire the lock on this final entity?

Well, I see that entity is managed by control queue 01. According to our telemetry, control queue 01 was being listened to "VM 19" from ~13:15 to ~13:31. Then, no VM is actively listening to new messages from that queue until ~13:54, when "VM 20" takes over.

To summarize, it appears that no new messages are being processed from that queue from ~13:31 to ~13:54. So now what I need to answer is the following: (1) Why does VM19 stop listening to this queue in the first place? Normally, this is because that VM is in the process of shutting down, another worker may steal ownership of its assigned queues, and therefore the VM will stop listening to new messages. (2) Why does it take so long for VM20 to start listening to queue 01? My guess is that VM 19 took too long to restart, which prevented VM20 from being able to listen to that queue (only one VM can listen to a single queue at a time, otherwise we risk "split brain" issues).

That's what I've been able to find so far. Assuming I understand the problem correctly, this means, to me, that simply restarting your app should suffice to mitigate this situation when using the Azure Storage backend.

I'll continue looking into the logs to answer the (2) open questions above, I'll keep you posted.

vany0114 commented 11 months ago

@davidmrdavid I much appreciate the update 🙏

vany0114 commented 11 months ago

[...] this means, to me, that simply restarting your app should suffice to mitigate this situation when using the Azure Storage backend.

Unfortunately, it didn't work, I'm gonna reset it again.

davidmrdavid commented 11 months ago

Hi @vany0114 - if you have a new timerange of when your app was stuck, as well as the timestamp of your restart, that would help us greatly

vany0114 commented 11 months ago

@davidmrdavid It has been stuck since yesterday (2023-08-07 13:34 UTC), I restarted it when I leave the previous comment. Now the func app is using Netherite and it's working fine, but I guess you'll be able to find a lot of instances stuck in the old storage account.

davidmrdavid commented 11 months ago

@vany0114: I'll be looking at the logs but I need a bit more details. In this post, you mentioned that some orchestrators were temporarily stuck / greatly delayed when acquiring locks. Those issues are from around ~13:34 UTC, and I think each self-healed within 30min if I understand correctly.

So when you say that it's been stuck since 2023-08-07 13:34 UTC, do you mean that there are some instanceIDs that never recovered? Or that you've been experiencing intermittent delayed orchestrations since that time. If you have an orchestrator that has been stuck since 2023-08-07 13:34 UTC and remains stuck until now, that would be a useful instanceID for us to have. Thanks!

vany0114 commented 11 months ago

It's not intermittent, it was happening basically with every request processed by the orchestrator, here are two more examples where those instances were stuck for hours.

As a result of that our processed orders rate was affected, you can see here how dramatically it dropped because the orchestrator was holding almost all of them. image

@davidmrdavid please let me know if the information provided helps

nytian commented 10 months ago

Hi, @vany0114 . My team member @davidmrdavid is OOF and I am taking over this case. Can you tell me what kind of id is this id is? "id":"3107849517381639" And is this time "businessDate":"2023-08-07T00:00:00" UTC time? Thanks!

I queried with function app name and time stamp since 2023-08-07T00:00:00 UTC, I didn't find any message with an Age of several hours. So, I think the orchestration instnaceId of the two above examples would be helpful for me to investigate.

vany0114 commented 10 months ago

Hi @nytian, the ids I've shared are actually the orchestration ids:

The orchestration ids are a small serialized JSON basically.

nytian commented 10 months ago

I queried the two InstanceId above. The orchestrator awoke when it received the external event, which is expected.

image

As you can see, the message ExternalEventRaised age is only 900, which means the message got picked up by the orchestration in less than 1 sec. I didn't find any problem here. It seems like the orchestrator is just waiting for the external event to be raised. And the time gap was when the orchestrator was just waiting.

So, I am curious what is the expected behavior for the event here. Does it supposed to raise event in such a later time? And how do you raise the event? Thanks!

vany0114 commented 10 months ago

@nytian not sure if you already read the whole thread and all the documentation provided but the problem here to figure out is why that event took that long, that's not an event our orchestrator is waiting for, it seems to be that internally when an entity is locked it waits for some sort of ack or something and that is what causes the orchestrations to get hang.

image

vany0114 commented 10 months ago

@nytian @davidmrdavid any updates?

vany0114 commented 10 months ago

More examples from today

FWIW it was working great after we started using Netherite, however yesterday we made a deployment around 2023-08-18T12:45 UTC, and after that, it seems the problem arises again.

@davidmrdavid can we get some help asap please, this is happening in our production environment. cc @cgillum

vany0114 commented 10 months ago

It happened again today, one weird thing I noticed is a spike in the Netherite Event Hub incoming messages around 8:32 am and 9:37 am UTC which does not match with the traffic volume during that time window. (FWIW the problem from yesterday showed up in the same time window)

image

As you can see the volume of requests in the event hub does not match the incoming traffic that that time, you can also see that during one hour the response time got degraded, it was 1 minute consistently (because of our timeout configuration). image

After that, it seems to be that the orchestrator auto-healed and started operating normally again.

This is one of the instances affected during that time window (same problem, locking an entity): {"storeId":"7794ca1e-7e8c-49ef-8e2a-763fc114a99c","courseUid":"cb03b4f3-a26a-4465-84d1-9e44a7ae65e2","checkNumber":"28584"}

image

@davidmrdavid @cgillum

vany0114 commented 10 months ago

@davidmrdavid @nytian any updates?

davidmrdavid commented 10 months ago

I'm back from OOF. I'm re-engaged on this issue, sync'ing with @nytian to understand the latest.

nytian commented 10 months ago

Hi, sorry for the late response. We are still working on to identify the root cause.

So first for the Azure Storage Backend with the old partition manager, with the orchestration instance Id you provided, the issue seems like the partition of control queue 01 couldn't be handed over to another worker for several hours. @davidmrdavid is working on creating a private package to mitigate this temporally in this case and we will give the private pkg to you tomorrow hopefully.

Since the issue hits on both versions of partition manager, we don't know if the above root cause is the same in the new partition manager. So, could you provide us any orchestration instanced/ TaskHub/ TimeStamp, that hit on this issue with the partition manager V3? Even if the storage account being deleted is fine, since we will keep the Kusto logs in a separate storage account. That would be helpful for me to identity the cause in the partition manager V3 scenario.

davidmrdavid commented 10 months ago

Just to add to @nytian's note above, here's my update:

We have yet to dig into the Netherite repro. To do so effectively, we will need assistance from our Netherite SME, who I'm engaging internally. The fact that you're experiencing this issue across DF Azure Storage and DF Netherite suggests that this issue may have nothing to do with your backend selection at all, so that is critical information. For example, one possibility is that this issue has to do with Entity message processing, which is backend agnostic.

However, from looking at your logs when you were using Azure Storage, I do see strong evidence of there being a problem with the partition management logic (a control queue is without a worker listening to it for several hours) so that's the first symptom I am interested in tackling with a mitigation package.

At this point, we're still combing through the noise. It's a bit difficult to identify a clear root cause theory because we need to consider logs across different partition management strategies and backend selections, but we're on it. I'll look to provide another update tomorrow.

davidmrdavid commented 10 months ago

@vany0114: just for clarity- is the issue in Netherite auto-healing whereas the same issue in Azure Storage is not? That's what I understood when you said: "After that, it seems to be that the orchestrator auto-healed and started operating normally again."

vany0114 commented 10 months ago

Appreciate the update @davidmrdavid Re the auto-healing that's correct, it never happened when using a storage account before, now it does, today for instance it took like 3 or 4 hours to heal, weird thing is that the problem seems to happen around the same time.

vany0114 commented 10 months ago

@nytian unfortunately I don't have that information since we're not using storage account anymore, and my understanding is that the V3 partition manager is only for storage account backend.

davidmrdavid commented 10 months ago

@vany0114: can you please clarify this sentence: "the auto-healing that's correct, it never happened when using a storage account before, now it does, today for instance it took like 3 or 4 hours to heal, [...]"

From that sentence, I understand that you're saying the auto-healing never happened with the Azure Storage backend, but now it does. This confuses me because I thought you were not using the Azure Storage backend anymore, so how do you know that it now auto-heals? For clarity's sake, it would help me if moving forward instead of saying "storage account" you could say either "using the Azure Storage backend" or "using the Netherite backend". Thanks!

Going over the Netherite repros now.

vany0114 commented 10 months ago

[...] I understand that you're saying the auto-healing never happened with the Azure Storage backend, but now it does.

@davidmrdavid that's correct.

This confuses me because I thought you were not using the Azure Storage backend anymore, so how do you know that it now auto-heals?

We're not using "the Azure Storage backend" anymore, that's also correct, when I say it is auto-healing is because after the orchestrator gets hung for one hour or four (still a problem), it backs to normal again (but this time we're using the Netherite backend), meaning it starts processing requests with a normal latency w/o me having to restart anything, it never happened when we were using "the Azure Storage backend" tho I always had to reset the durable function's storage account (is it correct for you using this term here? I guess it's what you call "the Azure Storage backend") and the func app storage's account.

Let me know if you need any other information. Thanks!

BTW are you guys publishing that mitigation package today or do you have an ETA?

davidmrdavid commented 10 months ago

Thanks @vany0114, I think that clarifies my question.

BTW are you guys publishing that mitigation package today or do you have an ETA?

We're not publishing it today because I would like to avoid asking you to revert back to the Azure Storage backend to try my mitigation without first understanding what's happening on Netherite. I understand the situation with Netherite is still not entirely mitigated, but with my limited understanding of the root problem so far, I think it's probably better to remain in it while we investigate your most recent incidents with it.

Unfortunately I can't provide you with an ETA just yet, but I'll be providing you with daily updates, if not more often.

vany0114 commented 10 months ago

BTW are you guys publishing that mitigation package today or do you have an ETA?

FWIW we did experience the same issue with the Azure Storage backend, with Netherite it is working way better, it's not 100% reliable but it's better with regards to what we got when using Azure Storage backend.

davidmrdavid commented 10 months ago

Note that I'm not the Netherite expert in our team, but I still wanted to share my observations so far, but note they are subject to correction:

I'm looking at the logs for {"storeId":"7794ca1e-7e8c-49ef-8e2a-763fc114a99c","courseUid":"cb03b4f3-a26a-4465-84d1-9e44a7ae65e2","checkNumber":"28584"} and I'm seeing something that seems rather Netherite specific: the partition hosting EntityId "@ticketnumberkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1" (partition 7) appears to be assigned to workers in quick succession from ~8:30 UTC to ~9:22 UTC.

My theory is that since this partition is constantly getting re-assigned to new owner workers, we are not able to effectively process its messages and therefore your locking request is taking a long while to process. At ~9:29, partition 7 finally gets a stable worker assignment and it processes the lock request for "@originorderkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1", which in turn requests the lock for the next entity "@scheduledkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1" and so on until all entities are locked and the original orchestrator is unblocked. This process occurs quickly once we're able to obtain the lock for ""@originorderkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1" at ~9:22, so I'm currently thinking that the frequent re-assignment of partition 7 across workers was to blame for the delay.

I do not know why partition 7 is being re-assigned to workers so often. I will need assistance from the Netherite SME to figure that out.

I'm going to analyze your other Netherite instanceIDs to see if the same behavior is to blame in those ones as well. From there, there next steps should be clearer.

davidmrdavid commented 10 months ago

I am seeing something similar for "'{"storeId":"5c7d0ceb-d19f-48e9-a62a-ed8702d3ff95","courseUid":"5e44fac5-239b-406f-b400-e29885b3833f","checkNumber":"13624"}'" on 8/19 at around ~7:50.

Here, the locking of ''@order@{"storeId":"5c7d0ceb-d19f-48e9-a62a-ed8702d3ff95","courseUid":"5e44fac5-239b-406f-b400-e29885b3833f","checkNumber":"13624"}'' goes to "partition 0 which, as far as I can tell, is getting moved around workers rather frequently during the time of the incident (~7:30 to around ~10:50). I can't fully figure out yet if any messages are being processed for that partition as this time, but it does appear to me that it's ownership is being transferred with high frequency.

I'm trying to determine if this is the result of a very sudden scale-in operation. If right before your application experienced this problem you had scaled out aggressively, and now you're scaling in quickly, then workers would be getting removed from your application in quick succession. If partition 0 is continuously getting transferred to workers that are getting deallocated from your app, then that can delay the processing of your lock request until partition 0 is assigned to an owner that isn't at risk of being removed. I'm definitely seeing evidence of scaling during this time, but I have yet to fully convince myself that this is to blame.

While I continue investigate: does this scale in theory match your observations / the known behavior of your app? Does your app scale out quickly right before you lock your entities and then scale in aggressively as you experience your orchestration being "stuck" while waiting for the lock?

vany0114 commented 10 months ago

does this scale in theory match your observations / the known behavior of your app?

During that time window, there's low traffic, I guess at that point the azure function scale controller starts killing idle instances, not sure.

Does your app scale out quickly right before you lock your entities and then scale in aggressively as you experience your orchestration being "stuck" while waiting for the lock?

Hard to tell, how can we determine that? we use a consumption plan, the scaling process is dynamic.

davidmrdavid commented 10 months ago

During that time window, there's low traffic, I guess at that point the azure function scale controller starts killing idle instances, not sure.

Yep, that's correct. When there's low traffic, or low activity of Azure Functions themselves, the Scale Controller will start removing instances.

Hard to tell, how can we determine that? we use a consumption plan, the scaling process is dynamic.

I would expect that your portal or application insights would have a visualization of this, but don't wory if you don't have immediate access to this information; for the cases I've checked so far, I can already see internally that there's scale in operations during the time of this incident. I just wanted to triple check if you had observed the same.

Anyways, I'm in the process of gathering up my internal queries to share them with the team for further validation. Still, my current theory is that you're hitting different problems per backend, but it's just that they have a similar user-facing consequence. I think this is consistent with the fact that the issue auto-heals in Netherite but it does not in Azure Storage. I'll provide another update tomorrow.

In the meantime, what's the frequency of this issue with Netherite? Are you seeing this at random, daily, rarely. That will help me determine next steps.

davidmrdavid commented 10 months ago

@vany0114: do you have a staging environment where we could get a repro of this issue with Azure Storage again? That would be a good way to test out a private package and mitigations alike. Similarly, we still have no insights as to what kind of problem occurred with the new Azure Storage partition manager (we only have an instanceId for the old one) so a repro with the new partition manager would be insightful.

vany0114 commented 10 months ago

In the meantime, what's the frequency of this issue with Netherite? Are you seeing this at random, daily, rarely. That will help me determine next steps.

@davidmrdavid It is happening consistently every single day, but yesterday it happened at a different time, around 23:28 UTC, For example:

vany0114 commented 10 months ago

@vany0114: do you have a staging environment where we could get a repro of this issue with Azure Storage again? That would be a good way to test out a private package and mitigations alike. Similarly, we still have no insights as to what kind of problem occurred with the new Azure Storage partition manager (we only have an instanceId for the old one) so a repro with the new partition manager would be insightful.

@davidmrdavid Unfortunately we don't

vany0114 commented 10 months ago

Posting these warnings just in case it is useful:

image

davidmrdavid commented 10 months ago

Thanks @vany0114.

The warning for "We detected activity from at least one non AzureStorage backend" can be safely ignored. It is simply a heads up that we are still in the process of adding thorough detectors for Netherite and MSSQL.

I had definitely noticed those platform-initiated restarts. That's an area of investigation I'm still pursuing. In particular, I'm attempting to correlate those restarts with scale in operations; that's why I brought that up a few comments ago.

That aside, I wanted to give you my update on this case for today:

On the Netherite side: I've shared my findings internally and the Netherite SME is engaged. I'll be waiting for their comments.

On the Azure Storage side: from the telemetry we have for Azure Storage, we observed that we had stopped listening to a particular control queue (control queue 01) for several hours. During this time, "VM 20" was still renewing its ownership over the queue, which prevented other VMs from attempting to listen to messages in that queue. I believe that's why your orchestrator was stuck.

As a quick but aggressive mitigation on Azure Storage, I'm prototyping and testing simple private package here (https://github.com/Azure/durabletask/pull/946) that fails fast whenever we detect that a given worker has continued renewing its ownership for a control queue for over 10 minutes despite no longer listening to new messages from it. This would be a sudden hard failure of the Azure Functions process and may pollute your telemetry whenever it happens, but it would immediately free up that queue to be consumed by another healthy worker.

I'll let you know once I think that package is sufficiently safe to be consumed, I need to do some testing. Even then, we'll need to discuss if it's even a good idea to switch back to the Azure Storage backend when at the very least Netherite is auto-healing. In any case, I'll be keeping you posted.

vany0114 commented 10 months ago

@davidmrdavid thanks for the update! Regarding the fix you're working on in the private package, is it possible to do the same in the Netherite package? the reason I'm asking is not only because we had a worse experience with the Azure Storage backend but also because it was in our roadmap to move to Netherite for performance/throughput reasons.

vany0114 commented 10 months ago

@davidmrdavid you guys can come up with something today to mitigate this situation? This is hurting us really badly. Last night we stopped processing orders from ~10 p.m. to ~5 a.m. today CT.

Orders processing stopped at ~10 p.m. CT image

Event Hub behavior: image

vany0114 commented 10 months ago

I've changed the task hub name FYI, according to the docs it restarts the application from a clean, empty state. I've called it: ordersagaworkaroundv1

image

vany0114 commented 10 months ago

@davidmrdavid any update from the Netherite SME?

davidmrdavid commented 10 months ago

@vany0114:

Regarding the fix you're working on in the private package, is it possible to do the same in the Netherite package? the reason I'm asking is not only because we had a worse experience with the Azure Storage backend but also because it was in our roadmap to move to Netherite for performance/throughput reasons.

I'm not able to implement the same bug fix in Netherite because the behavior I'm seeing in Azure Storage is not the same as what I'm seeing in your Netherite logs. In Azure Storage, for the data point you provided, I observed a clear and known kind of bug, so I can design a mitigation. In Netherite, since it's a different backend, the behavior is very different and I have not fully determined if there's a bug in Netherite itself, or if a different component is to blame.

To be direct, unless we make an breakthrough on the Netherite side of the investigation, I think our fastest path at mitigating this issue lies with a temporary return to Azure Storage; so that's why I'm working on a mitigation package with it.

With all that said, I've been broadening my investigation to look at the health of your Function app during the incident time ranges. At least for the last two incidents you reported ([8/22 23:20, 8/23 00:30] and now [8/25 03:00, 8/25 10:00]), I notice that your app is reporting both high CPU usage and a dramatic increase in timeout errors for ProviderOrderUpdatedIntegrationEventHandler, OrderDeliveredIntegrationEventHandler, and ProviderOrderPlacedIntegrationHandler.

See below for evidence of these timeouts:

timeouts

The 3 aformentioned Functions are not managed by the Durable Functions (they are EventGridTriggers it seems) so the fact that they are failing during this time suggests to me that maybe there's something else that is causing problems in your app. Do you have any intuition as to why these Functions would start timing out? I'm wondering if the issue is that your app is running in resource contention problems (which would explain the high CPU, high memory, and time out warnings I'm seeing outside of Durable Functions).

One way to validate this theory would be to do two things: (1) scale up (i.e change your plan to utilize more powerful virtual hardware) so that your app is less likely to run into high CPU and high memory warnings, which could indicate a real problem. (2) Try increasing your function timeout. I've seen cases where, in response to an Azure Function timeout, the entire Azure Functions process restarts just in case the timeout is the result of the process being unhealthy. This can cause stability problems if it occurs too often.