dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.43k stars 4.76k forks source link

LOH increases its size until OOM #109774

Open arekpalinski opened 1 week ago

arekpalinski commented 1 week ago

Description

I'm from RavenDB team and would like to share a report about potentially peculiar behavior of managed memory allocations. In particular it's about the size, utilization and fragmentation of Large Object Heap.

We observe the following issue with one of the following server in RavenDB Cloud. I assume it must be specific to type of queries this instance is handling as it's not typical pattern that we see somewhere else. The server is crashing due to OOM, at that time we see high managed memory allocation (yellow on below graph):

Image

Nov 05 18:19:32 vmf8905577d5 systemd[1]: ravendb.service: A process of this unit has been killed by the OOM killer.
Nov 05 18:19:35 vmf8905577d5 systemd[1]: ravendb.service: Main process exited, code=killed, status=9/KILL
...
Nov 05 18:19:36 vmf8905577d5 systemd[1]: ravendb.service: Failed with result 'oom-kill'.

The server has steady load of ~100 - 150 request/sec for 24/7. We can see the same pattern of managed allocations. We use the following GC settings:

    "configProperties": {
      "System.GC.Concurrent": true,
      "System.GC.Server": true,
      "System.GC.RetainVM": true,

Our first step was to check what takes that space so we collected dotMemory dump for analysis of managed memory when it was over 25GB. We found there that most of the memory is taken by LOH. Although out of 24GB that it occupied only 136MB was used:

Image

When looking at the LOH details in dotMemory we can see that most of LOH consist of empty areas (no objects allocated, 0% Utilization, 0% Fragmentation)

Image

We do have the mechanism where we force the LOH compaction from time to time during background GC (once it's gets bigger that 25% of physical memory):

https://github.com/ravendb/ravendb/blob/80ba404673f502f2db71269ecba1a135e56e7c95/src/Sparrow/LowMemory/LowMemoryNotification.cs#L150-L154


As next step we tried to experiment with some GC settings, collect GC info with the usage of dotnet-trace and check info provided by GC.GetGCMemoryInfo().

We found that with the following settings DOTNET_GCConserveMemory=8 (so LOH is compacted if it has too much fragmentation) and "System.GC.RetainVM": false the system behaves stable. We no longer experience oom-killer. The memory usage graph looks as follow for the last couple of days:

Image


Although we're still wondering whether sizes of LOH we see there are expected. We don't see LOH of this size anywhere else. We do know that allocations on LOH are mostly due to heavy usage of arrays by Lucene search engine (https://github.com/ravendb/lucenenet).

Although we have also noticed that when the size of LOH is very big, the fragmentation reported for it is small (based on info returned by GC.GetGCMemoryInfo() or from .nettrace file). For example we could see 0% fragmenation of LOH (Induced GC were forced intentionally during dotnet-trace recording):

Image

But shortly after that we collected dotMemory snapshot, where same as above we see almost empty LOH

Image


We also have another dotnet-trace session where we started to collect GC info (--profile gc-verbose) when the LOH was:

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "56.23 MBytes",
"FragmentationBeforeHumane": "56.23 MBytes",
"SizeAfterHumane": "24.44 GBytes",
"SizeBeforeHumane": "24.44 GBytes"

after 1h 20m it was:

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "50.44 MBytes",
"FragmentationBeforeHumane": "50.44 MBytes",
"SizeAfterHumane": "28.99 GBytes",
"SizeBeforeHumane": "28.99 GBytes"

and eventually the size got reduced to 10GB (and 9GB of fragmentation):

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "9.05 GBytes",
"FragmentationBeforeHumane": "9.05 GBytes",
"SizeAfterHumane": "10.25 GBytes",
"SizeBeforeHumane": "10.25 GBytes"

In PerfView it's seen as:

Image

Then we collected dotMemory snapshot again and LOH got reduced to 3.14GB (due to GC forced by dotMemory so LOH got compacted as I understand):

Image

We can provide .nettrace, dotMemory snapshot files privately if that would be helpful.

Reproduction Steps

It's reproduced only on a specific instance. We cannot provided access but we're able to collect debug assets.

Expected behavior

LOH will not reach the size of >20GB of memory

Actual behavior

LOH reaches >20 GB of memory, while fragmentation and utilization is very low

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

mangod9 commented 1 week ago

Would you be able to collect and share a gc-collect only trace on this instance? Also is this a regression after an app (or .NET) update?

arekpalinski commented 1 week ago

Please find the output of dotnet-trace for gc-collect profile: https://drive.google.com/file/d/1qanKgND8fvBt9WTLgOaiumfJugMQuZIn/view?usp=sharing

The trace started to be collected right after restarting RavenDB server. Same as previously DOTNET_GCConserveMemory=8 and "System.GC.RetainVM": false were applied.

arekpalinski commented 1 week ago

Please find another trace of gc-collect after a few hours: https://drive.google.com/file/d/16WcNFXf5fSNehT7zc_AZaSasFBHFSQGL/view?usp=sharing

Once it started to be collected we had the following LOH stats (note the machine has 32GB of memory):

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "85.67 MBytes",
"FragmentationBeforeHumane": "85.67 MBytes",
"SizeAfterHumane": "31.14 GBytes",
"SizeBeforeHumane": "31.14 GBytes"

...

"HeapSizeHumane": "34.73 GBytes",
"HighMemoryLoadThresholdHumane": "28.21 GBytes",
"Index": 497,
"MemoryLoadHumane": "24.76 GBytes",
"PauseDurations": [
"00:00:00.0172980",
"00:00:00"
],
"PauseTimePercentage": 0.13,
"PinnedObjectsCount": 2,
"PromotedHumane": "32.04 MBytes",
"TotalAvailableMemoryHumane": "31.34 GBytes",
"TotalCommittedHumane": "35.47 GBytes"

after 45 minutes we had:

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "103.57 MBytes",
"FragmentationBeforeHumane": "103.57 MBytes",
"SizeAfterHumane": "32.61 GBytes",
"SizeBeforeHumane": "32.61 GBytes"

...

"HeapSizeHumane": "36.35 GBytes",
"HighMemoryLoadThresholdHumane": "28.21 GBytes",
"Index": 623,
"MemoryLoadHumane": "26.01 GBytes",
"PauseDurations": [
  "00:00:00.0488880",
  "00:00:00"
],
"PauseTimePercentage": 0.13,
"PinnedObjectsCount": 2,
"PromotedHumane": "58.72 MBytes",
"TotalAvailableMemoryHumane": "31.34 GBytes",
"TotalCommittedHumane": "37.47 GBytes"

then I forced GC 2 times (with GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce and then GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced))

{"BeforeGC":"36.254 GBytes","AfterGC":"36.025 GBytes","Freed":"233.86 MBytes","DurationInMs":81,"PinnedObjectsCount":6,"FinalizationPendingCount":409,"PauseTimePercentage":"0.16%"}
{"BeforeGC":"37.045 GBytes","AfterGC":"1.157 GBytes","Freed":"35.887 GBytes","DurationInMs":194,"PinnedObjectsCount":4,"FinalizationPendingCount":550,"PauseTimePercentage":"0.16%"}

which resulted in the following LOH stats:

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "66.03 KBytes",
"FragmentationBeforeHumane": "66.03 KBytes",
"SizeAfterHumane": "1.28 GBytes",
"SizeBeforeHumane": "1.28 GBytes"

...

"HeapSizeHumane": "1.67 GBytes",
"HighMemoryLoadThresholdHumane": "28.21 GBytes",
"Index": 638,
"MemoryLoadBytes": 26921294233,
"MemoryLoadHumane": "25.07 GBytes",
"PauseDurations": [
  "00:00:00.0588390",
  "00:00:00"
],
"PauseTimePercentage": 0.16,
"PinnedObjectsCount": 2,
"PromotedHumane": "46.65 MBytes",
"TotalAvailableMemoryHumane": "31.34 GBytes",
"TotalCommittedHumane": "8.15 GBytes"

Image

kumariaditi52 commented 1 week ago

Please stop the email

On Thu, 14 Nov, 2024, 7:46 pm Arkadiusz Paliński, @.***> wrote:

Please find the output of dotnet-trace for gc-collect profile:

https://drive.google.com/file/d/1qanKgND8fvBt9WTLgOaiumfJugMQuZIn/view?usp=sharing

The trace started to be collected right after restarting RavenDB server. Same as previously DOTNET_GCConserveMemory=8 and "System.GC.RetainVM": false were applied.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/runtime/issues/109774#issuecomment-2476465704, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5PJBQPR4426FEIKDD2OOVD2ASWE5AVCNFSM6AAAAABRWM25AWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZWGQ3DKNZQGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mangod9 commented 1 week ago

Thanks for sharing the trace. Did the issue start occurring after moving to .NET 8 or is related some other change?

arekpalinski commented 1 week ago

The instance in question has some memory issues for quite long time (there are three nodes and we have a failover mechanism so an occasional restart of one node due to OOM doesn't affect us much). It was happening even on .NET 7 as well (I don't know if it could be related to LOH in the past. We didn't inspect that). I'd say it is not related to .NET 8 update. Also we don't see any correlation with our changes.

More likely it's happening more often recently due to changes of user's data in a database. Queries might return different set of results. It might be some allocation pattern for specific data and specific queries.

markples commented 6 days ago

In the first trace, LOH is growing over time, and BGC (background) is finding 99% of LOH to be live, including at GC # 85. Near the end of that trace, at # 92 (another BGC, intervening ones are all gen0 and gen1), 90% of the LOH is found to be dead. This shrinks the LOH size from 26GB to 11GB, but there is 76% fragmentation. Then at GC # 96, a blocking (or non-concurrent) GC is done, which reduces the size to 5GB and eliminates the fragmentation.

I wanted to double-check that no intervention was done here - in other words that the reclamation of memory worked fairly normally. (except possibly the large drop in memory - more on that later)

In the second trace, LOH is again growing over time, and it looks like nothing interesting happens until GC is forced. There are 3 induced GCs in the log. The first two do very little, and then the third reclaims most of the memory. LOH drops from 35GB to 1GB, and no fragmentation is left behind. This seems both normal and peculiar. The GC is trying to collect but finds everything live, and there isn't any fragmentation to clean up. But then all of a sudden, the state of the heap changes, and reclamation works well. It seems that either something is happening in the program, or something is different about the full blocking GC, which I don't think should usually happen.

So one question about that - when you force GC, do you do any other kind of cleanup?

(mainly for @mangod9: this is memory reported in use by the GC. I do not see growing freelist or decommit list sizes here.)

markples commented 6 days ago

There is one oddity that I noticed in the traces. There are a lot of handles (over 200k) at the end of the second trace. I also see a MarkWithType GC event listing basically the full heap under dependent handles. Handle processing has different algorithms for BGC vs not, though they shouldn't scan differently (and this wouldn't explain why it takes 3 forced GCs).

However, dependent handles are associated with ConditionalWeakTable, which can be used directly or indirectly (through places such as System.Transaction). CWT uses a two-pass finalization system for cleanup, which can have problems (see #50683 if interested in details). Could you please do a memory profile and see if you have CWT in the system, and specifically its internal Container class? This can take multiple gen2s to clean, and modifications to the CWT can delay the process.

arekpalinski commented 5 days ago

I wanted to double-check that no intervention was done here - in other words that the reclamation of memory worked fairly normally. (except possibly the large drop in memory - more on that later)

No external interventions was done there

So one question about that - when you force GC, do you do any other kind of cleanup?

No, I forced it manually (with a dedicated HTTP endpoint - /admin/memory/gc?loh=true). https://github.com/ravendb/ravendb/blob/fc318e40649881a130450611c98d1417a5eaf019/src/Raven.Server/Documents/Handlers/Admin/AdminMemoryHandler.cs#L14-L22 No other cleanup was done at that time.


Regarding ConditionalWeakTable usage, yes we do have usages of it via the search engine we use - Lucene.Net. It uses CWT for internal caching purposes quite heavity: https://github.com/search?q=repo%3Aravendb%2Flucenenet%20ConditionalWeakTable&type=code

Once I opened the dotMemory snapshot that was collected for the initial report of this issue (please note that I already mentioned, We can provide .nettrace, dotMemory snapshot files privately if that would be helpful. I can share it with you) I see:

Image

After checking the instances of the first entry (System.Runtime.CompilerServices.ConditionalWeakTable+Container<Object, ConcurrentDictionary<FieldCacheImpl+Entry, Object>>)

Image

which points to:

https://github.com/ravendb/lucenenet/blob/e791ebbb944e7d67814e189e9875a2ca515b8e9f/src/Lucene.Net/Search/FieldCacheImpl.cs#L49-L58

markples commented 2 days ago

Apologies for the delay - I'm still recovering from the windstorm in the NW US.   Could you please clarify which of reports that you show are from a memory dump and which are snapshots? I'm interested in this for two reasons: (1) https://www.jetbrains.com/help/dotmemory/Analyzing_GC_Roots.html#f-reachable-queue says we won't be able to see anything on the finalization list, and (2) if the induced GC is collecting a lot, then we won't be able to see what was there before.

I'm happy to look at a memory dump (if you are willing and able to share that) shared through https://developercommunity.visualstudio.com/dotnet. I'd saw that ConditionalWeakTable is still just a hunch. The last info that you shared of course doesn't point to a lot of memory, so it could only be CWT if the reporting of dead/finalizeable/already-collected objects is off, but since CWT's double finalization pass is a way to see this "multiple gen2s in order to reclaim memory" behavior, I'm not ready to rule it out either. Thank you!