Open arekpalinski opened 1 week ago
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
Would you be able to collect and share a gc-collect only trace on this instance? Also is this a regression after an app (or .NET) update?
Please find the output of dotnet-trace
for gc-collect
profile:
https://drive.google.com/file/d/1qanKgND8fvBt9WTLgOaiumfJugMQuZIn/view?usp=sharing
The trace started to be collected right after restarting RavenDB server. Same as previously DOTNET_GCConserveMemory=8
and "System.GC.RetainVM": false
were applied.
Please find another trace of gc-collect
after a few hours:
https://drive.google.com/file/d/16WcNFXf5fSNehT7zc_AZaSasFBHFSQGL/view?usp=sharing
Once it started to be collected we had the following LOH stats (note the machine has 32GB of memory):
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "85.67 MBytes",
"FragmentationBeforeHumane": "85.67 MBytes",
"SizeAfterHumane": "31.14 GBytes",
"SizeBeforeHumane": "31.14 GBytes"
...
"HeapSizeHumane": "34.73 GBytes",
"HighMemoryLoadThresholdHumane": "28.21 GBytes",
"Index": 497,
"MemoryLoadHumane": "24.76 GBytes",
"PauseDurations": [
"00:00:00.0172980",
"00:00:00"
],
"PauseTimePercentage": 0.13,
"PinnedObjectsCount": 2,
"PromotedHumane": "32.04 MBytes",
"TotalAvailableMemoryHumane": "31.34 GBytes",
"TotalCommittedHumane": "35.47 GBytes"
after 45 minutes we had:
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "103.57 MBytes",
"FragmentationBeforeHumane": "103.57 MBytes",
"SizeAfterHumane": "32.61 GBytes",
"SizeBeforeHumane": "32.61 GBytes"
...
"HeapSizeHumane": "36.35 GBytes",
"HighMemoryLoadThresholdHumane": "28.21 GBytes",
"Index": 623,
"MemoryLoadHumane": "26.01 GBytes",
"PauseDurations": [
"00:00:00.0488880",
"00:00:00"
],
"PauseTimePercentage": 0.13,
"PinnedObjectsCount": 2,
"PromotedHumane": "58.72 MBytes",
"TotalAvailableMemoryHumane": "31.34 GBytes",
"TotalCommittedHumane": "37.47 GBytes"
then I forced GC 2 times (with GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce
and then GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced)
)
{"BeforeGC":"36.254 GBytes","AfterGC":"36.025 GBytes","Freed":"233.86 MBytes","DurationInMs":81,"PinnedObjectsCount":6,"FinalizationPendingCount":409,"PauseTimePercentage":"0.16%"}
{"BeforeGC":"37.045 GBytes","AfterGC":"1.157 GBytes","Freed":"35.887 GBytes","DurationInMs":194,"PinnedObjectsCount":4,"FinalizationPendingCount":550,"PauseTimePercentage":"0.16%"}
which resulted in the following LOH stats:
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "66.03 KBytes",
"FragmentationBeforeHumane": "66.03 KBytes",
"SizeAfterHumane": "1.28 GBytes",
"SizeBeforeHumane": "1.28 GBytes"
...
"HeapSizeHumane": "1.67 GBytes",
"HighMemoryLoadThresholdHumane": "28.21 GBytes",
"Index": 638,
"MemoryLoadBytes": 26921294233,
"MemoryLoadHumane": "25.07 GBytes",
"PauseDurations": [
"00:00:00.0588390",
"00:00:00"
],
"PauseTimePercentage": 0.16,
"PinnedObjectsCount": 2,
"PromotedHumane": "46.65 MBytes",
"TotalAvailableMemoryHumane": "31.34 GBytes",
"TotalCommittedHumane": "8.15 GBytes"
Please stop the email
On Thu, 14 Nov, 2024, 7:46 pm Arkadiusz Paliński, @.***> wrote:
Please find the output of dotnet-trace for gc-collect profile:
https://drive.google.com/file/d/1qanKgND8fvBt9WTLgOaiumfJugMQuZIn/view?usp=sharing
The trace started to be collected right after restarting RavenDB server. Same as previously DOTNET_GCConserveMemory=8 and "System.GC.RetainVM": false were applied.
— Reply to this email directly, view it on GitHub https://github.com/dotnet/runtime/issues/109774#issuecomment-2476465704, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5PJBQPR4426FEIKDD2OOVD2ASWE5AVCNFSM6AAAAABRWM25AWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZWGQ3DKNZQGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for sharing the trace. Did the issue start occurring after moving to .NET 8 or is related some other change?
The instance in question has some memory issues for quite long time (there are three nodes and we have a failover mechanism so an occasional restart of one node due to OOM doesn't affect us much). It was happening even on .NET 7 as well (I don't know if it could be related to LOH in the past. We didn't inspect that). I'd say it is not related to .NET 8 update. Also we don't see any correlation with our changes.
More likely it's happening more often recently due to changes of user's data in a database. Queries might return different set of results. It might be some allocation pattern for specific data and specific queries.
In the first trace, LOH is growing over time, and BGC (background) is finding 99% of LOH to be live, including at GC # 85. Near the end of that trace, at # 92 (another BGC, intervening ones are all gen0 and gen1), 90% of the LOH is found to be dead. This shrinks the LOH size from 26GB to 11GB, but there is 76% fragmentation. Then at GC # 96, a blocking (or non-concurrent) GC is done, which reduces the size to 5GB and eliminates the fragmentation.
I wanted to double-check that no intervention was done here - in other words that the reclamation of memory worked fairly normally. (except possibly the large drop in memory - more on that later)
In the second trace, LOH is again growing over time, and it looks like nothing interesting happens until GC is forced. There are 3 induced GCs in the log. The first two do very little, and then the third reclaims most of the memory. LOH drops from 35GB to 1GB, and no fragmentation is left behind. This seems both normal and peculiar. The GC is trying to collect but finds everything live, and there isn't any fragmentation to clean up. But then all of a sudden, the state of the heap changes, and reclamation works well. It seems that either something is happening in the program, or something is different about the full blocking GC, which I don't think should usually happen.
So one question about that - when you force GC, do you do any other kind of cleanup?
(mainly for @mangod9: this is memory reported in use by the GC. I do not see growing freelist or decommit list sizes here.)
There is one oddity that I noticed in the traces. There are a lot of handles (over 200k) at the end of the second trace. I also see a MarkWithType
GC event listing basically the full heap under dependent handles. Handle processing has different algorithms for BGC vs not, though they shouldn't scan differently (and this wouldn't explain why it takes 3 forced GCs).
However, dependent handles are associated with ConditionalWeakTable, which can be used directly or indirectly (through places such as System.Transaction). CWT uses a two-pass finalization system for cleanup, which can have problems (see #50683 if interested in details). Could you please do a memory profile and see if you have CWT in the system, and specifically its internal Container class? This can take multiple gen2s to clean, and modifications to the CWT can delay the process.
I wanted to double-check that no intervention was done here - in other words that the reclamation of memory worked fairly normally. (except possibly the large drop in memory - more on that later)
No external interventions was done there
So one question about that - when you force GC, do you do any other kind of cleanup?
No, I forced it manually (with a dedicated HTTP endpoint - /admin/memory/gc?loh=true
).
https://github.com/ravendb/ravendb/blob/fc318e40649881a130450611c98d1417a5eaf019/src/Raven.Server/Documents/Handlers/Admin/AdminMemoryHandler.cs#L14-L22
No other cleanup was done at that time.
Regarding ConditionalWeakTable
usage, yes we do have usages of it via the search engine we use - Lucene.Net. It uses CWT for internal caching purposes quite heavity:
https://github.com/search?q=repo%3Aravendb%2Flucenenet%20ConditionalWeakTable&type=code
Once I opened the dotMemory snapshot that was collected for the initial report of this issue (please note that I already mentioned, We can provide .nettrace, dotMemory snapshot files privately if that would be helpful. I can share it with you) I see:
After checking the instances of the first entry (System.Runtime.CompilerServices.ConditionalWeakTable+Container<Object, ConcurrentDictionary<FieldCacheImpl+Entry, Object>>
)
which points to:
Apologies for the delay - I'm still recovering from the windstorm in the NW US. Could you please clarify which of reports that you show are from a memory dump and which are snapshots? I'm interested in this for two reasons: (1) https://www.jetbrains.com/help/dotmemory/Analyzing_GC_Roots.html#f-reachable-queue says we won't be able to see anything on the finalization list, and (2) if the induced GC is collecting a lot, then we won't be able to see what was there before.
I'm happy to look at a memory dump (if you are willing and able to share that) shared through https://developercommunity.visualstudio.com/dotnet. I'd saw that ConditionalWeakTable is still just a hunch. The last info that you shared of course doesn't point to a lot of memory, so it could only be CWT if the reporting of dead/finalizeable/already-collected objects is off, but since CWT's double finalization pass is a way to see this "multiple gen2s in order to reclaim memory" behavior, I'm not ready to rule it out either. Thank you!
Description
I'm from RavenDB team and would like to share a report about potentially peculiar behavior of managed memory allocations. In particular it's about the size, utilization and fragmentation of Large Object Heap.
We observe the following issue with one of the following server in RavenDB Cloud. I assume it must be specific to type of queries this instance is handling as it's not typical pattern that we see somewhere else. The server is crashing due to OOM, at that time we see high managed memory allocation (yellow on below graph):
The server has steady load of ~100 - 150 request/sec for 24/7. We can see the same pattern of managed allocations. We use the following GC settings:
Our first step was to check what takes that space so we collected
dotMemory
dump for analysis of managed memory when it was over 25GB. We found there that most of the memory is taken by LOH. Although out of 24GB that it occupied only 136MB was used:When looking at the LOH details in
dotMemory
we can see that most of LOH consist of empty areas (no objects allocated, 0% Utilization, 0% Fragmentation)We do have the mechanism where we force the LOH compaction from time to time during background GC (once it's gets bigger that 25% of physical memory):
https://github.com/ravendb/ravendb/blob/80ba404673f502f2db71269ecba1a135e56e7c95/src/Sparrow/LowMemory/LowMemoryNotification.cs#L150-L154
As next step we tried to experiment with some GC settings, collect GC info with the usage of
dotnet-trace
and check info provided byGC.GetGCMemoryInfo()
.We found that with the following settings
DOTNET_GCConserveMemory=8
(so LOH is compacted if it has too much fragmentation) and"System.GC.RetainVM": false
the system behaves stable. We no longer experience oom-killer. The memory usage graph looks as follow for the last couple of days:Although we're still wondering whether sizes of LOH we see there are expected. We don't see LOH of this size anywhere else. We do know that allocations on LOH are mostly due to heavy usage of arrays by Lucene search engine (https://github.com/ravendb/lucenenet).
Although we have also noticed that when the size of LOH is very big, the fragmentation reported for it is small (based on info returned by
GC.GetGCMemoryInfo()
or from.nettrace
file). For example we could see 0% fragmenation of LOH (Induced GC were forced intentionally duringdotnet-trace
recording):But shortly after that we collected
dotMemory
snapshot, where same as above we see almost empty LOHWe also have another
dotnet-trace
session where we started to collect GC info (--profile gc-verbose
) when the LOH was:after 1h 20m it was:
and eventually the size got reduced to 10GB (and 9GB of fragmentation):
In PerfView it's seen as:
Then we collected
dotMemory
snapshot again and LOH got reduced to 3.14GB (due to GC forced bydotMemory
so LOH got compacted as I understand):We can provide
.nettrace
,dotMemory
snapshot files privately if that would be helpful.Reproduction Steps
It's reproduced only on a specific instance. We cannot provided access but we're able to collect debug assets.
Expected behavior
LOH will not reach the size of >20GB of memory
Actual behavior
LOH reaches >20 GB of memory, while fragmentation and utilization is very low
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response