Open baal2000 opened 5 months ago
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
thanks for reporting the issue @baal2000. Would you be able to capture GCCollectOnly traces for both .NET 8 regions and with clrgc (segments) so we could investigate the differences?
Also some clarifications:
@mangod9
dotnet-trace
command with parameters (including for how long) you'd like us to run for both .NET 8 regions and with clrgc (segments) CC: @cshung
The dotnet-trace command is available here:
I am not sure I can specify a duration for you. The duration needs to be long enough so that it is sufficient to look at the growth over time, but not too long that stress your disk space. In general, GCCollect traces are supposed to be lightweight, logging just before and after GCs, so you should be able to turn it on for hours without issues.
I think an hour would be fine here. We'd like to see several GCs of the increase during regions and several of the oscillations under segments, so dotnet trace collect -p <pid> -o <outputpath with .nettrace extension> --profile gc-collect --duration 01:00:00
Additionally, since you have a dump, we could also use a few values from that.
?? coreclr!SVR::gc_heap::global_regions_to_decommit[0].size_committed_in_free_regions
?? coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions
?? coreclr!SVR::gc_heap::global_regions_to_decommit[2].size_committed_in_free_regions
?? coreclr!SVR::gc_heap::global_free_huge_regions.size_committed_in_free_regions
// for a sampling of x -- they should be similar though outliers would be interesting
?? coreclr!SVR::g_heaps[x]->free_regions[0].size_committed_in_free_regions
?? coreclr!SVR::g_heaps[x]->free_regions[1].size_committed_in_free_regions
?? coreclr!SVR::g_heaps[x]->free_regions[2].size_committed_in_free_regions
I believe you'll need to replace ??
with p
and drop the coreclr!
. Thank you!
Description
After upgrading a production environment comprised of a pool of large memory footprint processes running in Docker containers, one container per Ubuntu 20.04 VM host ranging from ~30 GB up to 1TB RAM from .NET 7 with segment-based
libcrlgc.so
GC heap (due to https://github.com/dotnet/runtime/issues/86183) to a default .NET 8 configuration some of the processes experienced sporadic out of memory crashes.Analysis
A typical pattern before was the process cycling between hitting the available memory limit, then deep gen 2 GC, then again raising to the limit, then gen2 GC, etc. Under .NET 8 (standard region-based GC) the pattern had changed to almost a straight line ending in out of memory Reverting to .NET 7 with segment-based
libcrlgc.so
GC heap reversed the pattern and the process became stable.The heap sizes looked similar under both scenarios: After taking a fill memory dump for both scenarios, the .NET native top object usage looked similarly: .net 7 (segment-based GC heap)
.net 8 (region-based GC heap)
Scaling up the machine up to 1.5+ its original size had eliminated the OM crashes but required 1.5+ more CPU cores that in turn costed 1.5+ in dollars for a more expensive cloud infrastructure.
To test the theory that the difference was due to the GC heap mode and not related to the framework version change from .NET 7 to .NET 8, we tried switching from .NET 7
libcrlgc.so
-> .NET 8default
-> .NET 8libcrlgc.so
-> .NET 8DOTNET_GCDynamicAdaptationMode
on another server that had experienced a similar issue. The test confirmed that the pattern was only dependent on the segmentlibcrlgc.so
vs. region - based heap. Switching to DATAS for region - based heap didn't not affect the pattern.Configuration
Regression?
Feels like one. Could be triggered when a process is already running close to its maximum available memory limit with no space to spare. The region-based GC heap might optimize its activity based on some factors: for instance minimizing GC pauses or busy preserving its pools of memory regions not realizing that there is a bigger issue of insufficient memory at hands that needs to be dealt with urgently.
Note that this is the second critical issue after https://github.com/dotnet/runtime/issues/97316 that we had experienced with the region-based GC heap mode that needs to be addressed by the team for the new GC mode to deliver on its better proformance promise.