Open cgillum opened 6 years ago
This will likely impact https://github.com/Azure/azure-functions-durable-extension/issues/368.
I think I ran into the same issue. I have an orchestration trigger that calls several sub orchestration triggers, that all fire a lot of activity triggers (by a lot I mean thousands of them). It runs well for about 30 minutes until I get an outofmemory exception. I profiled the code and this is what happens:
Any ideas how to resolve it?
Btw, this is on Functions v1.0.13.
Any news on this? I'm not receiving OutOfMemoryExceptions, however in local development the environment hits 1GB of memory and just hangs. I'm following a pattern almost identical to https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-cloud-backup#the-cloud-backup-orchestration-visual-studio-code-and-azure-portal-sample-code
I've noticed this too that in local debug in VS 2017 it won't let you use more than 1GB or RAM? Not sure if there is a setting somewhere I'm missing to change this value to at least 1.5GB to reflect the small instances on Consumption Plan?
@TheFairey the question is, why does it even need to consume 1GB RAM?
@aaronhudon Fair point although in my app I was doing some pretty heavy image manipulation and with 5 or so concurrent activity functions doing the same thing it frequently hit out of memory errors when it hit the 1GB limit so temporarily I had to drop to a max of 3 concurrent activity functions to keep memory usage under 1GB.
Is this issue fixed in release 1.6.2? Assuming not since this is still open, if so, is there any update on this?
I also had a similar problem.
Please see this an article. https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-perf-and-scale#orchestrator-function-replay
The latest Durable Functions Orchestrator
"Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="1.6.2"
seems to leave all the history in memory.
Therefore, when the following settings were reflected, the old history was collected by GC.
{
"version": "2.0",
"extensions": {
"durableTask": {
"extendedSessionsEnabled": true,
"extendedSessionIdleTimeoutInSeconds": 30
}
}
}
This is for Azure Functions v2.
The value of extendedSessionIdleTimeoutInSeconds
is environment dependent.
If it is a short time, it seems that consumption of CPU time by GC will increase.
A long time, memory usage will increase.
In my environment, with this setting, I have controlled the memory increase.
I hope this helps. ... Is the default setting reversed? Hum..
I have the same problem on 1.8.1. In a sub-orchestration I'm trying to fan out to 10 activities fetching around 2.5MB each. This is then done 5 times in sequence (I can't send more than 10 simultaneous request to the API where I'm fetching from). After about 3 iterations (10 + 10 + 10 activities) I hit 2 GB of memory in the debugger and the orchestration just stops without any exceptions. Tried arichika's solution, setting the extendedSession-properties in the host.json but it did not solve the problem in my case.
This issue seems to be increased by the amount of static fields you retain on any classes. For instance, if you're following the HttpClient pattern here, the amount of memory that isn't released increases and grows on all subsequent runs.
I was able to reduce the retained memory from 1g to 700mb by changing the way I build my HttpClients (and other usually static clients, like DocumentClient). In fact, subsequent runs of the same durable task pipeline didn't increase the leftover memory (also using the json settings above by @arichika).
It's important to be able to use static HttpClients in .NET, so removing them isn't ideal for high traffic projects. But as long as you're careful with how you build out those clients, it might help.
I'm using 1.6.2 currently until they fix the outputs in 1.8.*+.
Hi @cgillum I was able to reproduce memory leak issue even using fan-out for activity functions. Please see code sample below:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using MoreLinq;
namespace Company.FunctionApp
{
public class DurableFunctionsOrchestration
{
[FunctionName(nameof(StartHttp))]
public async Task<HttpResponseMessage> StartHttp(
[HttpTrigger(AuthorizationLevel.Anonymous, "get", "post")]
HttpRequestMessage request,
[OrchestrationClient] DurableOrchestrationClient client)
{
await client.StartNewAsync(nameof(MigrateToSdmOrchestrator), null);
return new HttpResponseMessage(HttpStatusCode.OK);
}
[FunctionName(nameof(MigrateToSdmOrchestrator))]
public async Task MigrateToSdmOrchestrator([OrchestrationTrigger] DurableOrchestrationContext context)
{
var batches = await context.CallActivityAsync<IReadOnlyCollection<VolumesBatch>>(
nameof(GetUploadedBatchesActivity),
null);
var uniqueBatchIds = batches
.Select(m => m.OriginalBatch)
.GroupBy(b => (b.Country, b.VeType, b.VoType))
.Select(g => g.OrderBy(b => b.BatchId).First().BatchId)
.ToList();
var subBatches = batches
.Where(m => uniqueBatchIds.Contains(m.OriginalBatch.BatchId))
.OrderBy(b => b.OriginalBatch.BatchId)
.ThenBy(b => b.Year)
.ThenBy(b => b.Month);
var routines = subBatches
.Select(batch => context.CallSubOrchestratorAsync(
nameof(WriteSubBatchToSdmActivity),
batch));
await Task.WhenAll(routines);
}
[FunctionName(nameof(WriteSubBatchToSdmActivity))]
public async Task WriteSubBatchToSdmActivity([ActivityTrigger] VolumesBatch batch)
{
await Console.Out.WriteLineAsync($"Processing batch: {batch.Id}");
await Task.Delay(3000);
}
[FunctionName(nameof(GetUploadedBatchesActivity))]
public Task<IReadOnlyCollection<VolumesBatch>> GetUploadedBatchesActivity(
[ActivityTrigger] DurableActivityContext context)
{
return Task.FromResult<IReadOnlyCollection<VolumesBatch>>(Enumerable.Range(0, 1000)
.Select(CreateBatch)
.ToList()
.AsReadOnly());
}
private VolumesBatch CreateBatch(int index)
{
return new VolumesBatch(
Guid.NewGuid(),
index,
index,
$"Region {index}",
String.Empty,
DateTime.UtcNow,
DateTime.UtcNow,
DateTime.UtcNow,
new HeritageBatch
{
Country = $"Country {index}",
Created = DateTime.UtcNow,
BatchId = index / 2,
Ready = DateTime.UtcNow,
EndDate = DateTime.UtcNow,
StartDate = DateTime.UtcNow,
VeType = $"Ve Type {index}",
VoType = $"Vo Type {index}"
});
}
}
}
host.json file:
"extensions": {
"durableTask": {
"logReplayEvents": false,
"controlQueueBatchSize": 4,
"traceInputsAndOutputs": true,
"maxConcurrentActivityFunctions": 2,
"maxConcurrentOrchestratorFunctions": 2,
"maxQueuePollingInterval": "00:00:10"
}
}
Seems that the list of HistoryEvent
objects retained even after orchestration completed.
Just in case, I've attached two memory snapshots to be able to do diff https://www.dropbox.com/s/p1idbb7lqw54062/Snapshots.zip?dl=0
@spzSource What do you use to show that screenshot (Process Dump)? I have memory issue at this moment with durable function, it will be nice if I can have that tool too
Hi @cgillum ,
I just had a similar issue while running an orchestration locally with below stack trace. It also results in an OutOfMemory exception. Not sure though if this is something related to local testing. Will try tomorrow on Azure. Host version is 2.0.12763.0.
"Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="1.8.2" "Microsoft.NET.Sdk.Functions" Version="1.0.29"
Stack trace is:
Function 'OrchestratorRebuildMaterializedProductsForStore (Orchestrator)' failed with an error. Reason: Microsoft.Azure.WebJobs.FunctionFailedException: The activity function 'ActivityRebuildMaterializedProductsForStore' failed: "Failed to deserialize exception from TaskActivity: Unhandled exception while executing task: System.AggregateException: An error occurred while writing to logger(s). (Exception of type 'System.OutOfMemoryException' was thrown.) ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
at System.Text.StringBuilder.Append(Char value, Int32 repeatCount)
at System.Text.StringBuilder.Append(Char value)
at System.Text.StringBuilder.AppendFormatHelper(IFormatProvider provider, String format, ParamsArray args)
at System.String.FormatHelper(IFormatProvider provider, String format, ParamsArray args)
at System.String.Format(IFormatProvider provider, String format, Object[] args)
at Microsoft.Extensions.Logging.Internal.LogValuesFormatter.Format(Object[] values)
at Microsoft.Extensions.Logging.Internal.FormattedLogValues.ToString()
at Microsoft.Extensions.Logging.LoggerExtensions.MessageFormatter(Object state, Exception error)
at Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLogger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter) in C:\azure-webjobs-sdk-script\src\WebJobs.Script.WebHost\Diagnostics\SystemLogger.cs:line 78
at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)
--- End of inner exception stack trace ---
at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)
at Microsoft.Extensions.Logging.LoggerExtensions.Log(ILogger logger, LogLevel logLevel, EventId eventId, Exception exception, String message, Object[] args)
at Microsoft.Azure.WebJobs.Extensions.DurableTask.EndToEndTraceHelper.FunctionFailed(String hubName, String functionName, String instanceId, String reason, FunctionType functionType, Boolean isReplay) in E:\functions\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\EndToEndTraceHelper.cs:line 307
at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskActivityShim.RunAsync(TaskContext context, String rawInput) in E:\functions\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\TaskActivityShim.cs:line 64
at DurableTask.Core.TaskActivityDispatcher.<>c__DisplayClass12_1.<<OnProcessWorkItemAsync>b__1>d.MoveNext() in C:\source\durabletask\src\DurableTask.Core\TaskActivityDispatcher.cs:line 135
---> (Inner Exception #0) System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
at System.Text.StringBuilder.Append(Char value, Int32 repeatCount)
at System.Text.StringBuilder.Append(Char value)
at System.Text.StringBuilder.AppendFormatHelper(IFormatProvider provider, String format, ParamsArray args)
at System.String.FormatHelper(IFormatProvider provider, String format, ParamsArray args)
at System.String.Format(IFormatProvider provider, String format, Object[] args)
at Microsoft.Extensions.Logging.Internal.LogValuesFormatter.Format(Object[] values)
at Microsoft.Extensions.Logging.Internal.FormattedLogValues.ToString()
at Microsoft.Extensions.Logging.LoggerExtensions.MessageFormatter(Object state, Exception error)
at Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLogger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter) in C:\azure-webjobs-sdk-script\src\WebJobs.Script.WebHost\Diagnostics\SystemLogger.cs:line 78
at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)<---
Can I help you somehow investigating this?
Hi @cgillum I was able to reproduce memory leak issue even using fan-out for activity functions. Please see code sample below:
Just in case, I've attached two memory snapshots to be able to do diff https://www.dropbox.com/s/p1idbb7lqw54062/Snapshots.zip?dl=0
Before I found this issue I came to same conclusion. DurableOrchestrationContext.History
(inherited from DurableCommonContext
) is keeping around it's list of HistoryEvents
long after orchestrations execute.
Turning off extended sessions fixed it for now, but at the cost of throughput.
We will priotize investigating this in the near future. I cannot guarantee that we will be able to fix it, because it seems to be an inherit limitation in how tasks are garbage collected when they aren't cancelled, and cancelling the tasks can have some unintended consequences.
We will priotize investigating this in the near future. I cannot guarantee that we will be able to fix it, because it seems to be an inherit limitation in how tasks are garbage collected when they aren't cancelled, and cancelling the tasks can have some unintended consequences. Any news on this? For a framework of this type to have such a potential restriction there (without a clear and easy workaround) seems problematic.
We will priotize investigating this in the near future. I cannot guarantee that we will be able to fix it, because it seems to be an inherit limitation in how tasks are garbage collected when they aren't cancelled, and cancelling the tasks can have some unintended consequences. Any news on this? For a framework of this type to have such a potential restriction there (without a clear and easy workaround) seems problematic.
Better to give up on the steaming pile of trash that is Azure Functions. Seriously, how can such a major bug in such a super common scenario still be open after 3 years (with as far as I know no acknowledgement of it in the documentation)...
Ok, so I have taken a closer look at Task.s_currentActiveTasks
to see how we could resolve this issue. Looking into the stack overflow comments for this question, it looks like s_currentActiveTasks
is only used by the debugger, which means this wouldn't impact production applications.
I am going to try reproing this both locally and in production to see in what scenarios a memory-leak is present.
Ok, so I ran this with the VS 2019 Performance Profile (memory-only) with a debug-build with almost the exact code in Chris's sample on Functions V3 and v2.5.1 of the extension. The only caveat is I had to translate the Durable 1.x code to Durable 2.x, and remove the parallelSubOrchestrations
parameter from ContinueAsNew()
, because I couldn't identify where it was being created or consumed. I used a debug build.
After 2 hours, the app essentially has remained oscillating between ~160 MB and ~210 MB, with it mostly hovering around 175MB during the two minute sleep window.
This seems to confirm that without a debugger firmly attached, we don't seem to experience a memory leak (or if we do, it is much more minor than the one originally reported).
I will try attaching a debugger to see if memory usage starts to grow.
Here was my app without a debugger attached running for two hours:
Here was my app with a debugger attached:
Note that attaching the debugger substantially slowed down the processing of suborchestrations, so that steady memory increase we had was over far fewer ContinueAsNew
iterations.
I think this very clearly shows that the memory leak in this sample only really happens with the debugger attached, as expected with the s_currentActiveTasks
being the culprit. I will change the name of the issue accordingly to accurately reflect the issue.
I think that in order to fix this, we would need to work with the Roslyn team. In the GitHub issue associated with my previously linked stack overflow question, Stephen Toub commented "Regardless of whether the implementation or debugger is doing the wrong thing here around explicitly removing the task from the debug-only s_currentActiveTasks table, we should change s_currentActiveTasks to be a ConditionalWeakTable or similar rather than a Dictionary, so that it's not rooting tasks unnecessarily." It appears that this value is still a Dictionary when I looked at the current source code, so we can follow up with that team to see if making that change would fix this issue.
In the meantime, I'm going to remove this from the high priority milestone, because a memory leak in debugger scenarios is far less concerning than a more general memory leak.
For anyone experiencing memory issues that finds this issue, I highly recommend reading this section of our performance docs. If you are finding other evidence of a memory leak running when a debugger is not attached, please file a new issue and we will investigate there!
I have a very high memory usage function app dealing in bulk transactions using upwards of 14gb of memory (lists and dictionaries of 100k+ objects) split across 1 orchestrator with 4 suborchestrators with ~5 activities each
When I run it with the debugger attached I see upwards of 14gb memory usage, when the orchestrations complete the memory slowly goes from ~14 to ~12 but never really gets GC
When I run it without the debugger attached, I monitor the Visual Studio Debugger Console Application Resources in the Task Manager and I see it oscillate between 2gb and 4gb, ultimately resting at ~2.5gb after all four orchestrations have completed and have the appropriate status "Completed" in the TestHubNameInstances
When hosted on Azure, I notice a little bit higher usage around ~7gb on an Premium EP3 and it takes a long time for the memory working set to drop back from ~7gb to settle around ~4gb it does seem like there is something going on with durable tasks to me - I would expect there to be a fairly quick dropoff on the memory working set after the orchestrations / suborchestrations have all completed and nothing is actively executing
Test Setup
I created a function app which fans out to 1,000 sub-orchestrations (which each do a simple hello world sequence) and had it run locally for several hours using
ContinueAsNew
.Top-Level Orchestration
The top-most orchestration calls a single sub-orchestration in a
ContinueAsNew
loop:Fan-Out Orchestration
This orchestration fans out to 1,000 sub-orchestrations.
Child Orchestration (simple sequence)
This is the inner-most orchestration, which is a simple sequence of activity functions.
Results
After a few hours, I observed several
OutOfMemoryException
errors, and noticed that the process working set of func.exe was growing continuously throughout the run.Here is the callstack:
This callstack seems to point to Task.s_currentActiveTasks as being unable to expand. This dictionary is responsible for tracking all currently running tasks in the app domain.
Investigation
Looking at allocations in the Visual Studio debugger, I noticed that there seem to be a large number of tasks that are not getting garbage collected. It’s my understanding that task-related memory leaks are possible when tasks are created but never completed. I believe that is exactly what TaskOrchestrationExecutor does – starts tasks, saves them in the TaskOrchestrationExecutor.result field, and never completes them. The executor objects obviously can be GC’d, but I don’t think that is sufficient to ensure the Task objects get GC’d.
Workaround
The root cause appears to be related to tasks that are created but never completed. This only happens when doing orchestration replay. For that reason, this can be partially mitigated by enabling extended sessions, as described in the Performance & Scale documentation: https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-perf-and-scale#orchestrator-function-replay