Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
711 stars 263 forks source link

Exit Code 137 thrown, orchestration fails #2788

Open SeanMJFowler opened 2 months ago

SeanMJFowler commented 2 months ago

Description

Running an orchestration on 20 "Always ready" EP3 instances where a large amount of data is cached in memory and therefore RAM usage is high. However I observed the issue live and no instance/server had more than 7.5GB RAM usage at the time (according to AppInsights live Metrics).

Expected behaviour

The orchestration succeeds (even if some tasks have to re-run).

Actual behaviour

The orchestration fails with an exit code of 137.

Durable 137 Error

Relevant source code snippets

None available

Known workarounds

We're going to try caching less data, at the expense of performance. However this will at best be a mitigation, not a fix.

App Details

We're using C# and have just upgraded to .NET 8 from .NET 6. The previous version didn't have this issue.

We cache large amounts of data for performance reasons, we use Microsoft.Extensions.Caching.Memory.IMemoryCache. The data we cache is static data with no expiration. We load the data on demand and then cache it rather than pre-loading the cache.

On .NET 6 occasionally instances/servers would disappear with high RAM usage and reappear with low RAM usage, presumably having restarted, but the orchestrations didn't fail because of it, they just took a little longer to complete successfully.

If deployed to Azure

SeanMJFowler commented 2 months ago

A little extra detail. We've done a test where we fully populated the cache (rather than load-on-demand) and when the server RAM usage got to around 6.5-7GB the orchestration failed with the 137 error. This was on a single EP3 server (Linux) so it should have 14GB RAM.

We did a second test caching less data (in fact we're caching the same data but as Parquet files rather than the unpacked data). That got up to 4.5GB RAM usage and completed successfully. However unpacking the Parquet data every time we need it isn't free, and also the low RAM ceiling could still be hit during normal usage and cause random orchestration failures. Basically this seems a lot less durable on .Net 8 than it was on .Net 6.

SeanMJFowler commented 2 months ago

@nytian can I challenge the 'Performance' label please? Yes we're reducing our caching to work around the bugs, which will impact performance. However I think there are two genuine bugs here and both affect stability/durability:

  1. The memory ceiling seems to be 7GB on a 14GB Linux server
  2. When the memory ceiling is hit the orchestration fails, when it should carry on and allow another server to pick up the failed task. I.e. a server-level failure shouldn't cause the orchestration to fail, and didn't in .Net 6
nytian commented 2 months ago

Thanks for providing the message, @SeanMJFowler I will take a first investigations.

nytian commented 2 months ago

@SeanMJFowler I am contacting with the support engineer of your open case so that we can open a sev3 customer reported incident on functions host team., considering the error message described is from host and also this is an regression beteween .Net 6 and .Net8.

cgillum commented 2 months ago

Looks like this case was assigned to me. :)

A couple things to clarify up front to avoid confusion:

We may need to experiment a bit to understand the root cause. A few questions/suggestions:

Thanks for your patience on this.

Side note: I'm also curious if this could be specific to the Azure Storage SDK, in which case another thing I'd like us to try is using a newer Azure Storage SDK version. However, I don't think we currently have a release with the newer SDK that is compatible with the .NET Isolated worker.

nytian commented 2 months ago

@cgillum Thanks, Chris for looking at this! I thought the case will go to host team.

Side note: I'm also curious if this could be specific to the Azure Storage SDK, in which case another thing I'd like us to try is using a newer Azure Storage SDK version. However, I don't think we currently have a release with the newer SDK that is compatible with the .NET Isolated worker.

I wonder if you are talking about the Azure Storage Track2 SDK. If that's the case, we do have a dotnet isolated version that uses the AS track2 , which is v1.2.0-rc.1 .

SeanMJFowler commented 2 months ago

Hi Chris, I've seen your many contributions over the years so I'm very happy to have you looking into this issue.

I can see why you're tying this issue together with the Data Lake timeouts as we have large numbers of those. However that's a separate issue and we have another ticket open with MS about it (we also got these errors with .Net 6). These timeouts don't cause orchestrations to fail as we have built-in retries. They do cause performance issues and ironically caching is our (partial) mitigation for that issue but reducing caching is our mitigation for this one. For now we're taking the performance hit as stability is more important.

From my research (e.g. thishttps://specificlanguages.com/posts/2022-07/14-exit-code-137/) error 137 on Linux relates to a SIGKILL command killing a process, and this generally happens when it runs out of RAM. This fits with the behaviour we're seeing, everything's fine until it gets to 6.5-7GB RAM usage and then the orchestration consistently fails. Now that we're caching far less we don't hit that level of RAM usage and we don't get the error, if it were the storage issue we'd still get the error.

I agree that error 137 is a process crash, but one issue is that this shouldn't cause the orchestration to fail but it does. This error isn't thrown in our code so we can't catch it and deal with it, so although it's a process crash I think it's in the host process.

In response to your questions:

1) Yes the storage is in the same region as the function app. We believe the Data Lake timeouts are due to contention at the Data Lake level, i.e. not specific to our subscription

2) As this is a Durable functions we could only use the in-process model for .Net 6. The RAM ceiling then was the expected 14GB and when a server crashed through lack of RAM the orchestration carried on and still completed successfully

3) This is our only app which uses Durable. We don't have any other apps which stress normal Azure Functions anywhere near this much. I agree that the low RAM ceiling and the fact that hitting it causes orchestrations to fail are likely two separate issues, unfortunately they combine to cause bigger problems than either would alone

4) It's not easy for us to switch to P2V3. In the Azure Portal only EP1-3 are available for our App Service Plan so I'd probably have to get the DevOps team involved to remove and recreate the App Service Plan and Function App. However if P2V3 might work better for us than EP3 it could be worth a go (more than 14GB RAM would be useful)

5) We'd like to switch to Netherite but it's a little way down the roadmap

I think if you were to generate large amounts of dummy data in-memory (i.e. take storage out of the equation) then you'd likely see the same behaviour we're experiencing, assuming you run on Linux EP3.

cgillum commented 2 months ago

@SeanMJFowler thanks for the detailed response! It looks like I misunderstood the issue - so it's not a crash that we're concerned about but rather than the orchestration actually goes into a "Failed" state, correct? That's really odd, because system failures should never cause orchestrations to fail.

Let me take a closer look at the instance ID you shared to see if I can get more information about what happened.

cgillum commented 2 months ago

Quick update: I did a quick check on orchestration instance c38c0921c282429695a5efc8fa8913c6 and here are a couple things I noticed:

  1. Several activities fail with internal errors unrelated to the actual activity execution. These appear to be related to the .NET Isolated worker host crashing. Error messages include "WorkerProcessExitException: dotnet exited with code 137 (0x89)" (mentioned above) and "System.InvalidOperationException: Did not find any initialized language workers".
  2. The functions host itself is not crashing, which confirms that the orchestration failures are unrelated to storage account latencies.

I don't think we'll be able to answer the question about why the worker process is crashing (we'll need help from another team for that) but the fact that it's being treated as an activity function failure (something I didn't realize until now) appears to be a bug in the Durable Functions extension. Normally, these kinds of transient host issues should be handled, and activity requests should be aborted and automatically retried without the orchestration's knowledge. I'm wondering if there's an exception-handling case that we're missing.

@SeanMJFowler apologies if I missed this, but are you able to work around this by adding retries to your activity function calls? It seems to me that should work as a workaround in this case until the other two issues (the worker crash and the incorrect exception handling) are resolved.

SeanMJFowler commented 2 months ago

Thanks Chris, that seems like the right track.

The fact that the worker crashes when it runs out of memory is actually expected behaviour for me as it’s what happened with .Net 6, and restarting is at least one way to clear down the RAM so it doesn’t keep failing. With isolated mode perhaps it could be done more gracefully and quickly (e.g. don’t reset the server and the host, just kill the worker) but the old way does still work.

What’s not expected is the low RAM ceiling and the orchestration failure, you’re dealing with the latter but presumably the other team should deal with the former? Will you be getting them involved or should we do that? If the underlying issues are a little more complicated that’s fine, you know this stuff a lot better than I do.

Adding retries to our activity functions wouldn’t work too well for us, at least if that means they go to the end of the queue. The queue can contain a few hours’-worth of work, so a retry could mean hours of delay in an orchestration completing. However I haven’t tried the retries to see how they work, if the tasks go to the top of the queue then that could work for us.

cgillum commented 2 months ago

What’s not expected is the low RAM ceiling and the orchestration failure, you’re dealing with the latter but presumably the other team should deal with the former? Will you be getting them involved or should we do that?

I can take care of getting the right folks involved for the memory-cap question. That can be done through the support ticket you already opened. I'll work on that.

Adding retries to our activity functions wouldn’t work too well for us, at least if that means they go to the end of the queue.

Are you referring to the internal queue in the Durable Functions task hub? If so, yes, activity retries will effectively go to the back of the internally managed work item queue.

SeanMJFowler commented 2 months ago

Thanks Chris.

Yes for the retries that’s what I meant. If the retry might have to wait a significant amount of time then it’s not suitable for us, we just need to avoid those failures occurring (which we’ve done by caching far less).

danbasszeti commented 1 month ago

Hi,

I think I have exactly the same issue, on the Consumption plan with the same history (upgrade to .NET 8). It seems to happen after I start creating PDF's and saving them in Azure Storage, and my memory use is pretty close to the limit of 1.5GB from what I can tell. As before I always used to see some activity function errors, but the retries took care of them so wasn't a problem. It's not good for us as it terminates our orchestrator prematurely.

Any ideas for fixes (wondering if chat continued offline)? I'm not in a position to start putting the apps onto Elastic Premium etc, but happy to do work like switching to Netherite if we know/strongly suspect that will help.

0bca773bf5c2452aad4d267c3158a860: Function ' (Orchestrator)' failed with an error. Reason: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions. ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: /azure-functions-host/workers/dotnet-isolated/bin/FunctionsNetHost exited with code 137 (0x89) ---> System.Exception --- End of inner exception stack trace --- at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters, FunctionInvocationContext context) in /src/azure-functions-host/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:line 101 at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionInvokerBase.cs:line 82 at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task1 src) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionGenerator.cs:line 225 at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker2.InvokeAsync(Object instance, Object[] arguments) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52 at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcMiddleware.<>cDisplayClass10_0.<b__0>d.MoveNext() in D:\a_work\1\s\src\WebJobs.Extensions.DurableTask\OutOfProcMiddleware.cs:line 130 --- End of stack trace from previous location --- at Microsoft.Azure.WebJobs.Host.Executors.TriggeredFunctionExecutor`1.<>cDisplayClass7_0.<b__0>d.MoveNext() in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\TriggeredFunctionExecutor.cs:line 50 --- End of stack trace from previous location --- at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 581 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 527 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306 --- End of inner exception stack trace --- at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in D:\a_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108. IsReplay: False. State: Failed. RuntimeStatus: Failed. HubName: . AppName: . SlotName: Production. ExtensionVersion: 2.13.1. SequenceNumber: 310. TaskEventId: -1

SeanMJFowler commented 1 month ago

Hiya, there's not been anything further offline. On our side the only work-around was to reduce the RAM usage. If you're able to create your PDFs as a stream and stream them to Azure Storage that could help, i.e. don't have the entire document in memory at once. Maybe you're already doing that, if so maybe try reducing the buffer size (default is 4MB I think).

Chris' suggestion of the retries wouldn't be suitable for us but might work for you? That's the retry functionality built into Durable Functions.

danbasszeti commented 1 month ago

Hi, thanks Sean!

OK that's helpful - good to know that reducing RAM usage will actually help too. Worst thing would be doing a bunch of RAM management then still have the problem. I think we are streaming atm, but good tip on the buffer size I'll definitely have a go.

We've got retries set up already for our activity functions so that should handle this sort of thing but don't have any on the orchestrator as didn't realise we could.

Thanks for getting back

danbasszeti commented 1 month ago

Can confirm RAM management got my usage down from 1.4 gb to 0.8gb and now I don't see this error anymore. Obviously I should be able to use all of the RAM i like without worrying about orchestrations failing etc.

smackodale commented 2 weeks ago

We are experiencing a similar issue with durable functions, oddly enough it's also related to PDF processing however I think this is just a coincidence. What I am witnessing is if an Activity causes a 137 out of memory issue on the same instance as the Orchestrator both the Activity and Orchestrator are killed, however the orchestrator never retries (although the activity does). Setting the following values, significantly reduces the likelihood as it forces the Function to scale out, however it does this at the expense of performance:

This does not stop the issue as it still does occur just far less frequently. Is there any news of a fix, as the reliability of durable functions is a key feature for us choosing this technology stack.