Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
717 stars 271 forks source link

Exit Code 137 thrown, orchestration fails #2788

Open SeanMJFowler opened 7 months ago

SeanMJFowler commented 7 months ago

Description

Running an orchestration on 20 "Always ready" EP3 instances where a large amount of data is cached in memory and therefore RAM usage is high. However I observed the issue live and no instance/server had more than 7.5GB RAM usage at the time (according to AppInsights live Metrics).

Expected behaviour

The orchestration succeeds (even if some tasks have to re-run).

Actual behaviour

The orchestration fails with an exit code of 137.

Durable 137 Error

Relevant source code snippets

None available

Known workarounds

We're going to try caching less data, at the expense of performance. However this will at best be a mitigation, not a fix.

App Details

We're using C# and have just upgraded to .NET 8 from .NET 6. The previous version didn't have this issue.

We cache large amounts of data for performance reasons, we use Microsoft.Extensions.Caching.Memory.IMemoryCache. The data we cache is static data with no expiration. We load the data on demand and then cache it rather than pre-loading the cache.

On .NET 6 occasionally instances/servers would disappear with high RAM usage and reappear with low RAM usage, presumably having restarted, but the orchestrations didn't fail because of it, they just took a little longer to complete successfully.

If deployed to Azure

SeanMJFowler commented 7 months ago

A little extra detail. We've done a test where we fully populated the cache (rather than load-on-demand) and when the server RAM usage got to around 6.5-7GB the orchestration failed with the 137 error. This was on a single EP3 server (Linux) so it should have 14GB RAM.

We did a second test caching less data (in fact we're caching the same data but as Parquet files rather than the unpacked data). That got up to 4.5GB RAM usage and completed successfully. However unpacking the Parquet data every time we need it isn't free, and also the low RAM ceiling could still be hit during normal usage and cause random orchestration failures. Basically this seems a lot less durable on .Net 8 than it was on .Net 6.

SeanMJFowler commented 7 months ago

@nytian can I challenge the 'Performance' label please? Yes we're reducing our caching to work around the bugs, which will impact performance. However I think there are two genuine bugs here and both affect stability/durability:

  1. The memory ceiling seems to be 7GB on a 14GB Linux server
  2. When the memory ceiling is hit the orchestration fails, when it should carry on and allow another server to pick up the failed task. I.e. a server-level failure shouldn't cause the orchestration to fail, and didn't in .Net 6
nytian commented 7 months ago

Thanks for providing the message, @SeanMJFowler I will take a first investigations.

nytian commented 7 months ago

@SeanMJFowler I am contacting with the support engineer of your open case so that we can open a sev3 customer reported incident on functions host team., considering the error message described is from host and also this is an regression beteween .Net 6 and .Net8.

cgillum commented 7 months ago

Looks like this case was assigned to me. :)

A couple things to clarify up front to avoid confusion:

We may need to experiment a bit to understand the root cause. A few questions/suggestions:

Thanks for your patience on this.

Side note: I'm also curious if this could be specific to the Azure Storage SDK, in which case another thing I'd like us to try is using a newer Azure Storage SDK version. However, I don't think we currently have a release with the newer SDK that is compatible with the .NET Isolated worker.

nytian commented 7 months ago

@cgillum Thanks, Chris for looking at this! I thought the case will go to host team.

Side note: I'm also curious if this could be specific to the Azure Storage SDK, in which case another thing I'd like us to try is using a newer Azure Storage SDK version. However, I don't think we currently have a release with the newer SDK that is compatible with the .NET Isolated worker.

I wonder if you are talking about the Azure Storage Track2 SDK. If that's the case, we do have a dotnet isolated version that uses the AS track2 , which is v1.2.0-rc.1 .

SeanMJFowler commented 7 months ago

Hi Chris, I've seen your many contributions over the years so I'm very happy to have you looking into this issue.

I can see why you're tying this issue together with the Data Lake timeouts as we have large numbers of those. However that's a separate issue and we have another ticket open with MS about it (we also got these errors with .Net 6). These timeouts don't cause orchestrations to fail as we have built-in retries. They do cause performance issues and ironically caching is our (partial) mitigation for that issue but reducing caching is our mitigation for this one. For now we're taking the performance hit as stability is more important.

From my research (e.g. thishttps://specificlanguages.com/posts/2022-07/14-exit-code-137/) error 137 on Linux relates to a SIGKILL command killing a process, and this generally happens when it runs out of RAM. This fits with the behaviour we're seeing, everything's fine until it gets to 6.5-7GB RAM usage and then the orchestration consistently fails. Now that we're caching far less we don't hit that level of RAM usage and we don't get the error, if it were the storage issue we'd still get the error.

I agree that error 137 is a process crash, but one issue is that this shouldn't cause the orchestration to fail but it does. This error isn't thrown in our code so we can't catch it and deal with it, so although it's a process crash I think it's in the host process.

In response to your questions:

1) Yes the storage is in the same region as the function app. We believe the Data Lake timeouts are due to contention at the Data Lake level, i.e. not specific to our subscription

2) As this is a Durable functions we could only use the in-process model for .Net 6. The RAM ceiling then was the expected 14GB and when a server crashed through lack of RAM the orchestration carried on and still completed successfully

3) This is our only app which uses Durable. We don't have any other apps which stress normal Azure Functions anywhere near this much. I agree that the low RAM ceiling and the fact that hitting it causes orchestrations to fail are likely two separate issues, unfortunately they combine to cause bigger problems than either would alone

4) It's not easy for us to switch to P2V3. In the Azure Portal only EP1-3 are available for our App Service Plan so I'd probably have to get the DevOps team involved to remove and recreate the App Service Plan and Function App. However if P2V3 might work better for us than EP3 it could be worth a go (more than 14GB RAM would be useful)

5) We'd like to switch to Netherite but it's a little way down the roadmap

I think if you were to generate large amounts of dummy data in-memory (i.e. take storage out of the equation) then you'd likely see the same behaviour we're experiencing, assuming you run on Linux EP3.

cgillum commented 6 months ago

@SeanMJFowler thanks for the detailed response! It looks like I misunderstood the issue - so it's not a crash that we're concerned about but rather than the orchestration actually goes into a "Failed" state, correct? That's really odd, because system failures should never cause orchestrations to fail.

Let me take a closer look at the instance ID you shared to see if I can get more information about what happened.

cgillum commented 6 months ago

Quick update: I did a quick check on orchestration instance c38c0921c282429695a5efc8fa8913c6 and here are a couple things I noticed:

  1. Several activities fail with internal errors unrelated to the actual activity execution. These appear to be related to the .NET Isolated worker host crashing. Error messages include "WorkerProcessExitException: dotnet exited with code 137 (0x89)" (mentioned above) and "System.InvalidOperationException: Did not find any initialized language workers".
  2. The functions host itself is not crashing, which confirms that the orchestration failures are unrelated to storage account latencies.

I don't think we'll be able to answer the question about why the worker process is crashing (we'll need help from another team for that) but the fact that it's being treated as an activity function failure (something I didn't realize until now) appears to be a bug in the Durable Functions extension. Normally, these kinds of transient host issues should be handled, and activity requests should be aborted and automatically retried without the orchestration's knowledge. I'm wondering if there's an exception-handling case that we're missing.

@SeanMJFowler apologies if I missed this, but are you able to work around this by adding retries to your activity function calls? It seems to me that should work as a workaround in this case until the other two issues (the worker crash and the incorrect exception handling) are resolved.

SeanMJFowler commented 6 months ago

Thanks Chris, that seems like the right track.

The fact that the worker crashes when it runs out of memory is actually expected behaviour for me as it’s what happened with .Net 6, and restarting is at least one way to clear down the RAM so it doesn’t keep failing. With isolated mode perhaps it could be done more gracefully and quickly (e.g. don’t reset the server and the host, just kill the worker) but the old way does still work.

What’s not expected is the low RAM ceiling and the orchestration failure, you’re dealing with the latter but presumably the other team should deal with the former? Will you be getting them involved or should we do that? If the underlying issues are a little more complicated that’s fine, you know this stuff a lot better than I do.

Adding retries to our activity functions wouldn’t work too well for us, at least if that means they go to the end of the queue. The queue can contain a few hours’-worth of work, so a retry could mean hours of delay in an orchestration completing. However I haven’t tried the retries to see how they work, if the tasks go to the top of the queue then that could work for us.

cgillum commented 6 months ago

What’s not expected is the low RAM ceiling and the orchestration failure, you’re dealing with the latter but presumably the other team should deal with the former? Will you be getting them involved or should we do that?

I can take care of getting the right folks involved for the memory-cap question. That can be done through the support ticket you already opened. I'll work on that.

Adding retries to our activity functions wouldn’t work too well for us, at least if that means they go to the end of the queue.

Are you referring to the internal queue in the Durable Functions task hub? If so, yes, activity retries will effectively go to the back of the internally managed work item queue.

SeanMJFowler commented 6 months ago

Thanks Chris.

Yes for the retries that’s what I meant. If the retry might have to wait a significant amount of time then it’s not suitable for us, we just need to avoid those failures occurring (which we’ve done by caching far less).

danbasszeti commented 6 months ago

Hi,

I think I have exactly the same issue, on the Consumption plan with the same history (upgrade to .NET 8). It seems to happen after I start creating PDF's and saving them in Azure Storage, and my memory use is pretty close to the limit of 1.5GB from what I can tell. As before I always used to see some activity function errors, but the retries took care of them so wasn't a problem. It's not good for us as it terminates our orchestrator prematurely.

Any ideas for fixes (wondering if chat continued offline)? I'm not in a position to start putting the apps onto Elastic Premium etc, but happy to do work like switching to Netherite if we know/strongly suspect that will help.

0bca773bf5c2452aad4d267c3158a860: Function ' (Orchestrator)' failed with an error. Reason: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions. ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: /azure-functions-host/workers/dotnet-isolated/bin/FunctionsNetHost exited with code 137 (0x89) ---> System.Exception --- End of inner exception stack trace --- at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters, FunctionInvocationContext context) in /src/azure-functions-host/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:line 101 at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionInvokerBase.cs:line 82 at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task1 src) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionGenerator.cs:line 225 at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker2.InvokeAsync(Object instance, Object[] arguments) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52 at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcMiddleware.<>cDisplayClass10_0.<b__0>d.MoveNext() in D:\a_work\1\s\src\WebJobs.Extensions.DurableTask\OutOfProcMiddleware.cs:line 130 --- End of stack trace from previous location --- at Microsoft.Azure.WebJobs.Host.Executors.TriggeredFunctionExecutor`1.<>cDisplayClass7_0.<b__0>d.MoveNext() in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\TriggeredFunctionExecutor.cs:line 50 --- End of stack trace from previous location --- at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 581 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 527 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306 --- End of inner exception stack trace --- at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108. IsReplay: False. State: Failed. RuntimeStatus: Failed. HubName: . AppName: . SlotName: Production. ExtensionVersion: 2.13.1. SequenceNumber: 310. TaskEventId: -1

SeanMJFowler commented 6 months ago

Hiya, there's not been anything further offline. On our side the only work-around was to reduce the RAM usage. If you're able to create your PDFs as a stream and stream them to Azure Storage that could help, i.e. don't have the entire document in memory at once. Maybe you're already doing that, if so maybe try reducing the buffer size (default is 4MB I think).

Chris' suggestion of the retries wouldn't be suitable for us but might work for you? That's the retry functionality built into Durable Functions.

danbasszeti commented 6 months ago

Hi, thanks Sean!

OK that's helpful - good to know that reducing RAM usage will actually help too. Worst thing would be doing a bunch of RAM management then still have the problem. I think we are streaming atm, but good tip on the buffer size I'll definitely have a go.

We've got retries set up already for our activity functions so that should handle this sort of thing but don't have any on the orchestrator as didn't realise we could.

Thanks for getting back

danbasszeti commented 5 months ago

Can confirm RAM management got my usage down from 1.4 gb to 0.8gb and now I don't see this error anymore. Obviously I should be able to use all of the RAM i like without worrying about orchestrations failing etc.

smackodale commented 5 months ago

We are experiencing a similar issue with durable functions, oddly enough it's also related to PDF processing however I think this is just a coincidence. What I am witnessing is if an Activity causes a 137 out of memory issue on the same instance as the Orchestrator both the Activity and Orchestrator are killed, however the orchestrator never retries (although the activity does). Setting the following values, significantly reduces the likelihood as it forces the Function to scale out, however it does this at the expense of performance:

This does not stop the issue as it still does occur just far less frequently. Is there any news of a fix, as the reliability of durable functions is a key feature for us choosing this technology stack.

cgillum commented 4 months ago

@smackodale sorry for the late reply, and sorry to hear that you're running into this as well. Since this appears to be a platform issue, the Azure support route might be the best place to help find a resolution.

That said, can you mention what kind of plan you're running on? For example, are you also using an Elastic Premium plan? And what is the operating system you're using?

What I am witnessing is if an Activity causes a 137 out of memory issue on the same instance as the Orchestrator both the Activity and Orchestrator are killed, however the orchestrator never retries (although the activity does).

Can you clarify what you mean about the orchestrator never retrying? Normally when there's a process crash in an activity, only the activity would retry. If the retry is successful, then a message will be sent to the orchestrator function, which then resumes normally. Are you not seeing the orchestrator function resume after the activity retries and completes successfully?

smackodale commented 4 months ago

@cgillum We have tested on all plans for function apps. The only difference between the results if the frequency, with the smaller plans the lower resources increases the likelihood of the error.

What we were doing was utilising the fan out pattern which could trigger 100-200 activities, these activities were resource intense, specifically in the area of memory. We had the premium plan configured to scale out the maximum # of instances. What we witnessed was that as resources began to be consumed, the host would begin to run out of memory. The host would then selectively kill some executions (this is where the 137 error would occur). If the execution killed was an activity, it would be retried (as expected). However if the execution that was terminated was an orchestrator or sub orchestrator it would never retry and never complete the remaining activities within the orchestrator. All activities that had been created and added to the queues would complete but the end-to-end would fail.

We could significantly reduce the likelihood of this complete failure from occurring by ensuring the orchestrators used as minimal memory as possible, however it still occurs frequently enough to have the process unreliable. If the orchestrator is terminated via 137, there is no retry.

cgillum commented 4 months ago

@smackodale thanks for the details. Does the orchestration go into a Failed state when this happens?

I had a chat with @davidmrdavid about this and I think I have a theory about why orchestrations are getting stuck. The memory issue seems to be causing a crash in the isolated worker process. Most likely, the Durable Functions extension running in the host, detects this crash but misinterprets the error it sees as an application failure rather than an infrastructure failure. As a result, it marks the orchestrator as failed instead of retrying it. If this is what's happening, the fix might be simple: we need to check for a WorkerProcessExitException (which I see mentioned in one of @danbasszeti's responses) and abort the current execution rather than allowing it to be interpreted as an app failure.

I still don't have clarity on the reports of this happening with lower-than-expected memory usage. We'll need help from the Functions host team to narrow that down. One possibility is that the memory usage being reported isn't completely accurate when running in the isolated worker model. Another possibility is that the isolated worker is somehow running as a 32-bit process instead of a 64-bit process (restricting the amount of memory it's allowed to use).

smackodale commented 4 months ago

@cgillum They do go into a failed state when viewing the executions within Table storage.

goransh commented 2 months ago

Any updates on this?

We are having the same issue. Exit code 137 and max 7.5 gb ram per instance in live metrics on EP3 with .NET 8.

We are triggering many long-running orchestrations that are memory intensive and most of them are running for up to 12 hours (if they do not crash before completing). Right now we are considering implementing our own retry logic to mitigate this, but it would of course be preferable if the built-in retry mechanism can handle this instead.

If not, are there any known workarounds?

cgillum commented 2 months ago

No updates on the crash due to memory pressure, but we do have an engineer working on the problem where the orchestration transitions to a Failed status because of the crash. The intent is that the orchestration continues running to completion after the crash.

SeanMJFowler commented 2 months ago

Apart from reducing memory usage, the other option is switching to a non-EP plan. We're trying P3V3, which has 32GB RAM (and 8 vCPUs) and that allows us to enable caching and still not crash.

The cost per instance is similar to EP3 and you can set it to up to 30 instances. However it doesn't auto-scale with Azure Functions (or at least I haven't got it to) and it can take half a day or more to scale the servers up to the count you asked for (e.g. we change it from 1 server to 20 at 4am and often by midday we still don't have all of them).

It's not a solution but depending on your needs it might keep you going. Also, having more vCPUs can be useful.