Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
716 stars 271 forks source link

Memory leak when debugger attached #340

Open cgillum opened 6 years ago

cgillum commented 6 years ago

Test Setup

I created a function app which fans out to 1,000 sub-orchestrations (which each do a simple hello world sequence) and had it run locally for several hours using ContinueAsNew.

Top-Level Orchestration

The top-most orchestration calls a single sub-orchestration in a ContinueAsNew loop:

  [FunctionName(nameof(RunContinuousSubOrchestratorFanOut))]
  public static async Task RunContinuousSubOrchestratorFanOut(
      [OrchestrationTrigger] DurableOrchestrationContext ctx,
      TraceWriter log)
  {
      DateTime startTime = ctx.CurrentUtcDateTime;
      await ctx.CallSubOrchestratorAsync(nameof(SubOrchestratorFanOut), 1000);

      // Wait 2 minutes before starting the next run
      if (!ctx.IsReplaying) log.Info($"Sleeping for 2 minutes.");
      await ctx.CreateTimer(ctx.CurrentUtcDateTime.AddMinutes(2), CancellationToken.None);
      ctx.ContinueAsNew(parallelSubOrchestrations);
  }

Fan-Out Orchestration

This orchestration fans out to 1,000 sub-orchestrations.

  [FunctionName(nameof(SubOrchestratorFanOut))]
  public static async Task SubOrchestratorFanOut([OrchestrationTrigger] DurableOrchestrationContext ctx)
  {
      int parallelSubOrchestrations = ctx.GetInput<int>();

      var subOrchestrationList = new List<Task>();
      IEnumerable<Task> subOrchestrationTasks = Enumerable.Range(0, parallelSubOrchestrations).Select(i => 
          ctx.CallSubOrchestratorAsync(
              functionName: nameof(HelloSequenceOrchestration.HelloSequence),
              input: i));

      Task allSubOrchestrationsCompletedTask = Task.WhenAll(subOrchestrationTasks);

      using (var cts = new CancellationTokenSource())
      {
          Task timeoutTask = ctx.CreateTimer(ctx.CurrentUtcDateTime.AddHours(1), cts.Token);
          if (timeoutTask == await Task.WhenAny(allSubOrchestrationsCompletedTask, timeoutTask))
          {
              throw new TimeoutException("One or more sub-orchestrations failed to complete within an hour!");
          }
          else
          {
              cts.Cancel();
          }
      }
  }

Child Orchestration (simple sequence)

This is the inner-most orchestration, which is a simple sequence of activity functions.

[FunctionName(nameof(HelloSequence))]
public static async Task<List<string>> HelloSequence(
    [OrchestrationTrigger] DurableOrchestrationContextBase context)
{
    var outputs = new List<string>();

    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Tokyo"));
    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Seattle"));
    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "London"));

    // returns ["Hello Tokyo!", "Hello Seattle!", "Hello London!"]
    return outputs;
}

Results

After a few hours, I observed several OutOfMemoryException errors, and noticed that the process working set of func.exe was growing continuously throughout the run.

Here is the callstack:

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
   at System.Threading.Tasks.Task.AddToActiveTasks(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents(Task task, Action continuation)
   at System.Runtime.CompilerServices.TaskAwaiter.OnCompletedInternal(Task task, Action continuation, Boolean continueOnCapturedContext, Boolean flowExecutionContext)
   at System.Runtime.CompilerServices.TaskAwaiter`1.UnsafeOnCompleted(Action continuation)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.<>c.<ThrowAsync>b__6_1(Object state)
   at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

This callstack seems to point to Task.s_currentActiveTasks as being unable to expand. This dictionary is responsible for tracking all currently running tasks in the app domain.

Investigation

Looking at allocations in the Visual Studio debugger, I noticed that there seem to be a large number of tasks that are not getting garbage collected. It’s my understanding that task-related memory leaks are possible when tasks are created but never completed. I believe that is exactly what TaskOrchestrationExecutor does – starts tasks, saves them in the TaskOrchestrationExecutor.result field, and never completes them. The executor objects obviously can be GC’d, but I don’t think that is sufficient to ensure the Task objects get GC’d.

Workaround

The root cause appears to be related to tasks that are created but never completed. This only happens when doing orchestration replay. For that reason, this can be partially mitigated by enabling extended sessions, as described in the Performance & Scale documentation: https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-perf-and-scale#orchestrator-function-replay

cgillum commented 6 years ago

This will likely impact https://github.com/Azure/azure-functions-durable-extension/issues/368.

Avalaxy commented 6 years ago

I think I ran into the same issue. I have an orchestration trigger that calls several sub orchestration triggers, that all fire a lot of activity triggers (by a lot I mean thousands of them). It runs well for about 30 minutes until I get an outofmemory exception. I profiled the code and this is what happens:

image

Any ideas how to resolve it?

Btw, this is on Functions v1.0.13.

aaronhudon commented 6 years ago

Any news on this? I'm not receiving OutOfMemoryExceptions, however in local development the environment hits 1GB of memory and just hangs. I'm following a pattern almost identical to https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-cloud-backup#the-cloud-backup-orchestration-visual-studio-code-and-azure-portal-sample-code

TheFairey commented 6 years ago

I've noticed this too that in local debug in VS 2017 it won't let you use more than 1GB or RAM? Not sure if there is a setting somewhere I'm missing to change this value to at least 1.5GB to reflect the small instances on Consumption Plan?

aaronhudon commented 6 years ago

@TheFairey the question is, why does it even need to consume 1GB RAM?

TheFairey commented 6 years ago

@aaronhudon Fair point although in my app I was doing some pretty heavy image manipulation and with 5 or so concurrent activity functions doing the same thing it frequently hit out of memory errors when it hit the 1GB limit so temporarily I had to drop to a max of 3 concurrent activity functions to keep memory usage under 1GB.

anhhnguyen206 commented 6 years ago

Is this issue fixed in release 1.6.2? Assuming not since this is still open, if so, is there any update on this?

arichika commented 6 years ago

I also had a similar problem.

Please see this an article. https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-perf-and-scale#orchestrator-function-replay

The latest Durable Functions Orchestrator "Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="1.6.2" seems to leave all the history in memory. Therefore, when the following settings were reflected, the old history was collected by GC.

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "extendedSessionsEnabled": true,
      "extendedSessionIdleTimeoutInSeconds": 30
    }
  }
}

This is for Azure Functions v2.

The value of extendedSessionIdleTimeoutInSeconds is environment dependent. If it is a short time, it seems that consumption of CPU time by GC will increase. A long time, memory usage will increase.

In my environment, with this setting, I have controlled the memory increase.

I hope this helps. ... Is the default setting reversed? Hum..

sebb3 commented 5 years ago

I have the same problem on 1.8.1. In a sub-orchestration I'm trying to fan out to 10 activities fetching around 2.5MB each. This is then done 5 times in sequence (I can't send more than 10 simultaneous request to the API where I'm fetching from). After about 3 iterations (10 + 10 + 10 activities) I hit 2 GB of memory in the debugger and the orchestration just stops without any exceptions. Tried arichika's solution, setting the extendedSession-properties in the host.json but it did not solve the problem in my case.

larryeby commented 5 years ago

This issue seems to be increased by the amount of static fields you retain on any classes. For instance, if you're following the HttpClient pattern here, the amount of memory that isn't released increases and grows on all subsequent runs.

I was able to reduce the retained memory from 1g to 700mb by changing the way I build my HttpClients (and other usually static clients, like DocumentClient). In fact, subsequent runs of the same durable task pipeline didn't increase the leftover memory (also using the json settings above by @arichika).

It's important to be able to use static HttpClients in .NET, so removing them isn't ideal for high traffic projects. But as long as you're careful with how you build out those clients, it might help.

I'm using 1.6.2 currently until they fix the outputs in 1.8.*+.

spzSource commented 5 years ago

Hi @cgillum I was able to reproduce memory leak issue even using fan-out for activity functions. Please see code sample below:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using MoreLinq;

namespace Company.FunctionApp
{
    public class DurableFunctionsOrchestration
    {
        [FunctionName(nameof(StartHttp))]
        public async Task<HttpResponseMessage> StartHttp(
            [HttpTrigger(AuthorizationLevel.Anonymous, "get", "post")]
            HttpRequestMessage request,
            [OrchestrationClient] DurableOrchestrationClient client)
        {
            await client.StartNewAsync(nameof(MigrateToSdmOrchestrator), null);

            return new HttpResponseMessage(HttpStatusCode.OK);
        }

        [FunctionName(nameof(MigrateToSdmOrchestrator))]
        public async Task MigrateToSdmOrchestrator([OrchestrationTrigger] DurableOrchestrationContext context)
        {
            var batches = await context.CallActivityAsync<IReadOnlyCollection<VolumesBatch>>(
                nameof(GetUploadedBatchesActivity),
                null);

            var uniqueBatchIds = batches
                .Select(m => m.OriginalBatch)
                .GroupBy(b => (b.Country, b.VeType, b.VoType))
                .Select(g => g.OrderBy(b => b.BatchId).First().BatchId)
                .ToList();

            var subBatches = batches
                .Where(m => uniqueBatchIds.Contains(m.OriginalBatch.BatchId))
                .OrderBy(b => b.OriginalBatch.BatchId)
                .ThenBy(b => b.Year)
                .ThenBy(b => b.Month);

            var routines = subBatches
                .Select(batch => context.CallSubOrchestratorAsync(
                    nameof(WriteSubBatchToSdmActivity),
                    batch));

            await Task.WhenAll(routines);
        }

        [FunctionName(nameof(WriteSubBatchToSdmActivity))]
        public async Task WriteSubBatchToSdmActivity([ActivityTrigger] VolumesBatch batch)
        {
            await Console.Out.WriteLineAsync($"Processing batch: {batch.Id}");
            await Task.Delay(3000);
        }

        [FunctionName(nameof(GetUploadedBatchesActivity))]
        public Task<IReadOnlyCollection<VolumesBatch>> GetUploadedBatchesActivity(
            [ActivityTrigger] DurableActivityContext context)
        {
            return Task.FromResult<IReadOnlyCollection<VolumesBatch>>(Enumerable.Range(0, 1000)
                .Select(CreateBatch)
                .ToList()
                .AsReadOnly());
        }

        private VolumesBatch CreateBatch(int index)
        {
            return new VolumesBatch(
                Guid.NewGuid(),
                index,
                index,
                $"Region {index}",
                String.Empty,
                DateTime.UtcNow,
                DateTime.UtcNow,
                DateTime.UtcNow,
                new HeritageBatch
                {
                    Country = $"Country {index}",
                    Created = DateTime.UtcNow,
                    BatchId = index / 2,
                    Ready = DateTime.UtcNow,
                    EndDate = DateTime.UtcNow,
                    StartDate = DateTime.UtcNow,
                    VeType = $"Ve Type {index}",
                    VoType = $"Vo Type {index}"
                });
        }
    }
}

host.json file:

"extensions": {
        "durableTask": {
            "logReplayEvents": false,
            "controlQueueBatchSize": 4,
            "traceInputsAndOutputs": true,
            "maxConcurrentActivityFunctions": 2,
            "maxConcurrentOrchestratorFunctions": 2,
            "maxQueuePollingInterval": "00:00:10"
        }
    }

Seems that the list of HistoryEvent objects retained even after orchestration completed.

image

Just in case, I've attached two memory snapshots to be able to do diff https://www.dropbox.com/s/p1idbb7lqw54062/Snapshots.zip?dl=0

andrew-vdb commented 5 years ago

@spzSource What do you use to show that screenshot (Process Dump)? I have memory issue at this moment with durable function, it will be nice if I can have that tool too

lukasvosyka commented 5 years ago

Hi @cgillum ,

I just had a similar issue while running an orchestration locally with below stack trace. It also results in an OutOfMemory exception. Not sure though if this is something related to local testing. Will try tomorrow on Azure. Host version is 2.0.12763.0.

"Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="1.8.2" "Microsoft.NET.Sdk.Functions" Version="1.0.29"

Stack trace is:

Function 'OrchestratorRebuildMaterializedProductsForStore (Orchestrator)' failed with an error. Reason: Microsoft.Azure.WebJobs.FunctionFailedException: The activity function 'ActivityRebuildMaterializedProductsForStore' failed: "Failed to deserialize exception from TaskActivity: Unhandled exception while executing task: System.AggregateException: An error occurred while writing to logger(s). (Exception of type 'System.OutOfMemoryException' was thrown.) ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
   at System.Text.StringBuilder.Append(Char value, Int32 repeatCount)
   at System.Text.StringBuilder.Append(Char value)
   at System.Text.StringBuilder.AppendFormatHelper(IFormatProvider provider, String format, ParamsArray args)
   at System.String.FormatHelper(IFormatProvider provider, String format, ParamsArray args)
   at System.String.Format(IFormatProvider provider, String format, Object[] args)
   at Microsoft.Extensions.Logging.Internal.LogValuesFormatter.Format(Object[] values)
   at Microsoft.Extensions.Logging.Internal.FormattedLogValues.ToString()
   at Microsoft.Extensions.Logging.LoggerExtensions.MessageFormatter(Object state, Exception error)
   at Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLogger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter) in C:\azure-webjobs-sdk-script\src\WebJobs.Script.WebHost\Diagnostics\SystemLogger.cs:line 78
   at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)
   --- End of inner exception stack trace ---
   at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)
   at Microsoft.Extensions.Logging.LoggerExtensions.Log(ILogger logger, LogLevel logLevel, EventId eventId, Exception exception, String message, Object[] args)
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.EndToEndTraceHelper.FunctionFailed(String hubName, String functionName, String instanceId, String reason, FunctionType functionType, Boolean isReplay) in E:\functions\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\EndToEndTraceHelper.cs:line 307
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskActivityShim.RunAsync(TaskContext context, String rawInput) in E:\functions\azure-functions-durable-extension\src\WebJobs.Extensions.DurableTask\Listener\TaskActivityShim.cs:line 64
   at DurableTask.Core.TaskActivityDispatcher.<>c__DisplayClass12_1.<<OnProcessWorkItemAsync>b__1>d.MoveNext() in C:\source\durabletask\src\DurableTask.Core\TaskActivityDispatcher.cs:line 135
---> (Inner Exception #0) System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
   at System.Text.StringBuilder.Append(Char value, Int32 repeatCount)
   at System.Text.StringBuilder.Append(Char value)
   at System.Text.StringBuilder.AppendFormatHelper(IFormatProvider provider, String format, ParamsArray args)
   at System.String.FormatHelper(IFormatProvider provider, String format, ParamsArray args)
   at System.String.Format(IFormatProvider provider, String format, Object[] args)
   at Microsoft.Extensions.Logging.Internal.LogValuesFormatter.Format(Object[] values)
   at Microsoft.Extensions.Logging.Internal.FormattedLogValues.ToString()
   at Microsoft.Extensions.Logging.LoggerExtensions.MessageFormatter(Object state, Exception error)
   at Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLogger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter) in C:\azure-webjobs-sdk-script\src\WebJobs.Script.WebHost\Diagnostics\SystemLogger.cs:line 78
   at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)<---

Can I help you somehow investigating this?

jschieck commented 3 years ago

Hi @cgillum I was able to reproduce memory leak issue even using fan-out for activity functions. Please see code sample below: image

Just in case, I've attached two memory snapshots to be able to do diff https://www.dropbox.com/s/p1idbb7lqw54062/Snapshots.zip?dl=0

Before I found this issue I came to same conclusion. DurableOrchestrationContext.History (inherited from DurableCommonContext) is keeping around it's list of HistoryEvents long after orchestrations execute.

Turning off extended sessions fixed it for now, but at the cost of throughput.

ConnorMcMahon commented 3 years ago

We will priotize investigating this in the near future. I cannot guarantee that we will be able to fix it, because it seems to be an inherit limitation in how tasks are garbage collected when they aren't cancelled, and cancelling the tasks can have some unintended consequences.

MSDNAndi commented 3 years ago

We will priotize investigating this in the near future. I cannot guarantee that we will be able to fix it, because it seems to be an inherit limitation in how tasks are garbage collected when they aren't cancelled, and cancelling the tasks can have some unintended consequences. Any news on this? For a framework of this type to have such a potential restriction there (without a clear and easy workaround) seems problematic.

Avalaxy commented 3 years ago

We will priotize investigating this in the near future. I cannot guarantee that we will be able to fix it, because it seems to be an inherit limitation in how tasks are garbage collected when they aren't cancelled, and cancelling the tasks can have some unintended consequences. Any news on this? For a framework of this type to have such a potential restriction there (without a clear and easy workaround) seems problematic.

Better to give up on the steaming pile of trash that is Azure Functions. Seriously, how can such a major bug in such a super common scenario still be open after 3 years (with as far as I know no acknowledgement of it in the documentation)...

ConnorMcMahon commented 3 years ago

Ok, so I have taken a closer look at Task.s_currentActiveTasks to see how we could resolve this issue. Looking into the stack overflow comments for this question, it looks like s_currentActiveTasks is only used by the debugger, which means this wouldn't impact production applications.

I am going to try reproing this both locally and in production to see in what scenarios a memory-leak is present.

ConnorMcMahon commented 3 years ago

Ok, so I ran this with the VS 2019 Performance Profile (memory-only) with a debug-build with almost the exact code in Chris's sample on Functions V3 and v2.5.1 of the extension. The only caveat is I had to translate the Durable 1.x code to Durable 2.x, and remove the parallelSubOrchestrations parameter from ContinueAsNew(), because I couldn't identify where it was being created or consumed. I used a debug build.

After 2 hours, the app essentially has remained oscillating between ~160 MB and ~210 MB, with it mostly hovering around 175MB during the two minute sleep window.

This seems to confirm that without a debugger firmly attached, we don't seem to experience a memory leak (or if we do, it is much more minor than the one originally reported).

I will try attaching a debugger to see if memory usage starts to grow.

ConnorMcMahon commented 3 years ago

Here was my app without a debugger attached running for two hours:

image

Here was my app with a debugger attached:

image

Note that attaching the debugger substantially slowed down the processing of suborchestrations, so that steady memory increase we had was over far fewer ContinueAsNew iterations.

I think this very clearly shows that the memory leak in this sample only really happens with the debugger attached, as expected with the s_currentActiveTasks being the culprit. I will change the name of the issue accordingly to accurately reflect the issue.

I think that in order to fix this, we would need to work with the Roslyn team. In the GitHub issue associated with my previously linked stack overflow question, Stephen Toub commented "Regardless of whether the implementation or debugger is doing the wrong thing here around explicitly removing the task from the debug-only s_currentActiveTasks table, we should change s_currentActiveTasks to be a ConditionalWeakTable or similar rather than a Dictionary, so that it's not rooting tasks unnecessarily." It appears that this value is still a Dictionary when I looked at the current source code, so we can follow up with that team to see if making that change would fix this issue.

In the meantime, I'm going to remove this from the high priority milestone, because a memory leak in debugger scenarios is far less concerning than a more general memory leak.

For anyone experiencing memory issues that finds this issue, I highly recommend reading this section of our performance docs. If you are finding other evidence of a memory leak running when a debugger is not attached, please file a new issue and we will investigate there!

Tyler-V commented 2 years ago

I have a very high memory usage function app dealing in bulk transactions using upwards of 14gb of memory (lists and dictionaries of 100k+ objects) split across 1 orchestrator with 4 suborchestrators with ~5 activities each

When I run it with the debugger attached I see upwards of 14gb memory usage, when the orchestrations complete the memory slowly goes from ~14 to ~12 but never really gets GC

When I run it without the debugger attached, I monitor the Visual Studio Debugger Console Application Resources in the Task Manager and I see it oscillate between 2gb and 4gb, ultimately resting at ~2.5gb after all four orchestrations have completed and have the appropriate status "Completed" in the TestHubNameInstances

When hosted on Azure, I notice a little bit higher usage around ~7gb on an Premium EP3 and it takes a long time for the memory working set to drop back from ~7gb to settle around ~4gb it does seem like there is something going on with durable tasks to me - I would expect there to be a fairly quick dropoff on the memory working set after the orchestrations / suborchestrations have all completed and nothing is actively executing