OutOfMemoryException in MessageSorter

GuusRaaphorst commented 1 year ago

Description

We have an Azure function app that at some point started to log out of memory exceptions when trying to start an orchestration.

The function app basically does the following:

A TimeTrigger function executes every 5 minutes, checks if the cache for a specific external system (a 'POS') is still fresh. If not, an orchestration is started to update the cache.
The orchestration will then (among some other things) start a suborchestration that in turn
- Creates a new entity: var entityId = EntityId(nameof(PosEntity), posName)
- Locks it and executes a function to get information from the POS and update the cache

Note that we use a durable entity because of some peculiarities of the external system. It only allows 1 request at a time. With the entity, we make sure that it is called only once.

Expected behavior

I would not expect out of memory exceptions from the durabletask framework.

Actual behavior

At some point, but not at the same time, some of our environments (test and acc) started to show performance degradation and high memory usage causing alerts. The logging that we have in place showed the following log lines (the last 2 a lot more than the first 2):

@posentity@lab: Function 'posentity (Entity)' failed. TraceFlags: 1Y. Details: OutOfMemoryException. HubName: FnBHub. AppName:-redacted-. SlotName: Production. ExtensionVersion: 2.10.0. SequenceNumber: 207. @posentity@lab: Orchestration execution was aborted: Session aborted because of OutOfMemoryException, traceFlags=1Y!D @posentity@lab: Message [EventRaised] with ID 55a247bc-7af5-4b41-be2c-acd2805fa166 has been dequeued 307 times and is now considered poison @posentity@lab: Abandoning [EventRaised] message back to fnbhub-control-03 and setting a visibility delay of 600ms

Relevant source code snippets

I do not have code that I am allowed to share at the moment.

Known workarounds

I have tried a lot, that did not seem to help

deploying an older version of the function app
upgrading the durable extensions packages to 2.11.1
commenting out specific parts of our code
using the REST API to purge the durable functions history/instances

Adding the following settings (before, we used the defaults) and playing with the values a little does seem to help. Currently our test environment uses the following values and seems to be runnning ok for a couple of hours now.

"maxConcurrentActivityFunctions": 50,
"maxConcurrentOrchestratorFunctions": 15,
"maxEntityOperationBatchSize": 10

App Details

Azure function running on an Elastic plan (EP1, also containing other function apps)
Azure Functions runtime version : v4
Programming language used : C#, .net 6.0
Packages
- "Microsoft.Azure.Functions.Extensions" Version="1.1.0"
- "Microsoft.Azure.WebJobs" Version="3.0.37"
- "Microsoft.Azure.WebJobs.Extensions" Version="5.0.0"
- "Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.10.0"
- "Microsoft.Azure.WebJobs.Extensions.OpenApi" Version="1.5.1"
- "Microsoft.Azure.WebJobs.Extensions.ServiceBus" Version="5.12.0"
- "Microsoft.Azure.WebJobs.Extensions.Storage" Version="5.1.3"

Screenshots

At some point I was able to create profiler trace and a memory dump of the function app (see images). The results of those are pointing to the MessageSorter and the DurableTask.RequestMessage. That is why I am opening this ticket. To inform, but also hoping to get some guidance on what is going on here and if I am doing something wrong.

If deployed to Azure

Timeframe issue observed : started around september 1st 14:00
Orchestration instance ID(s) : I do not see an orchestration instance involved here. It seems that the orchestration cannot start. I do have executionId's, but not sure if that helps:
- 12fcfef0c5ec44f987e1b9365f0e09a0, 5e9054d102fa4a69abac766ab8625174
- these id's come from the loglines that show Orchestration execution was aborted: Session aborted because of OutOfMemoryException, traceFlags=1Y!D

App-MemoryUsage App-CpuUsage

davidmrdavid commented 1 year ago

Thanks @GuusRaaphorst. There's been some known memory issues around the message sorter behavior of Entities, so this may be a known issue. See here for some context.

My suggestion would be to reduce the message sorter re-ordering window to reduce how big the message sorting array can get, but for some reason I'm not seeing that as a configurable value in our host.json settings. So I need to look into that, let me get back to you.

GuusRaaphorst commented 1 year ago

Thanks for your quick response @davidmrdavid !

I assume you are referring to EntityMessageReorderWindowInMinutesfrom here? I'll give it a try!

I also see that Netherite should solve this problem, so we might need to take a look in that too.

GuusRaaphorst commented 1 year ago

I have been playing with settings, cleaning all history of durable functions, etc. Nothing.

In the end it appeared to be one little stupid change causing havoc in all our functions apps that uses durable functions. The services that we use all share some common stuff. One part of this is that on startup, some general services (e.g. logging) are registered.

Some lines of code were added there, to improve on our automatic openapi documentation generation. This code was doing this: JsonConvert.DefaultSettings = () => jsonSettings; and adding some specific JsonConverters, related to dates, enums, etc.

Removing this solved the problems. I have no idea why this caused the problems that we saw and I also have no idea how I could have seen this in logging or anything.

But, it all works again. Just wanted to let you know..

davidmrdavid commented 1 year ago

Thanks for the report, @GuusRaaphorst.

Regarding this line of code: JsonConvert.DefaultSettings = () => jsonSettings;

Oh, I've seen that error before. In older versions of the Durable Extension, it was possible for user code to accidentally override the settings that the DF extension itself uses for serialization doing exactly that, which in turn, breaks all sorts of low level details. I thought I had fixed that.

@GuusRaaphorst: do you have a minimal repro showing this error with the latest DF release? That would help us greatly.

Azure / azure-functions-durable-extension