dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.03k stars 4.68k forks source link

Process memory exhaustion under region-based GC heap mode #103582

Open baal2000 opened 3 months ago

baal2000 commented 3 months ago

Description

After upgrading a production environment comprised of a pool of large memory footprint processes running in Docker containers, one container per Ubuntu 20.04 VM host ranging from ~30 GB up to 1TB RAM from .NET 7 with segment-based libcrlgc.so GC heap (due to https://github.com/dotnet/runtime/issues/86183) to a default .NET 8 configuration some of the processes experienced sporadic out of memory crashes.

Analysis

A typical pattern before was the process cycling between hitting the available memory limit, then deep gen 2 GC, then again raising to the limit, then gen2 GC, etc. image Under .NET 8 (standard region-based GC) the pattern had changed to almost a straight line ending in out of memory image Reverting to .NET 7 with segment-based libcrlgc.so GC heap reversed the pattern and the process became stable.

The heap sizes looked similar under both scenarios: image After taking a fill memory dump for both scenarios, the .NET native top object usage looked similarly: .net 7 (segment-based GC heap)

  kilobytes |Object count|Type
  23,402,813| 272,323,647|Class1
  11,726,220|      63,078|Class2[]
   5,024,993|     117,549|Class1[]
   3,877,060|         173|$.ValueTuple<Class2,Class1>[]
   3,214,924|  22,861,688|Class3
     850,206|      13,871|$.Collections.Generic.HashSet+Entry<Class4>[]
     698,673|   1,527,108|$.Int32[]
     664,435|  28,349,262|Class5
     664,193|  11,920,171|$.String
     608,850|   3,247,205|Class6
....    
       9,165|     233,684|Free

  60,011,021| 453,466,459|TOTAL

.net 8 (region-based GC heap)

  kilobytes |Object count|Type
  26,227,185| 305,189,069|Class1
  12,293,589|      57,454|Class2[]
   3,988,836|      80,069|Class1[]
   3,074,227|  21,861,170|Class3
   1,613,392|          91|$.ValueTuple<Class2,.Class1>[]
   1,500,896|      12,276|$.Collections.Generic.HashSet+Entry<Class4>[]
     633,966|  27,049,234|Class5
     615,378|   3,282,018|Class6
     539,491|     758,663|$.Int32[]

      29,343|      23,047|Free

  57,974,870| 451,380,242|TOTAL

Scaling up the machine up to 1.5+ its original size image had eliminated the OM crashes but required 1.5+ more CPU cores that in turn costed 1.5+ in dollars for a more expensive cloud infrastructure.

To test the theory that the difference was due to the GC heap mode and not related to the framework version change from .NET 7 to .NET 8, we tried switching from .NET 7 libcrlgc.so -> .NET 8 default -> .NET 8 libcrlgc.so -> .NET 8 DOTNET_GCDynamicAdaptationMode on another server that had experienced a similar issue. The test confirmed that the pattern was only dependent on the segment libcrlgc.so vs. region - based heap. Switching to DATAS for region - based heap didn't not affect the pattern. image

Configuration

Regression?

Feels like one. Could be triggered when a process is already running close to its maximum available memory limit with no space to spare. The region-based GC heap might optimize its activity based on some factors: for instance minimizing GC pauses or busy preserving its pools of memory regions not realizing that there is a bigger issue of insufficient memory at hands that needs to be dealt with urgently.

Note that this is the second critical issue after https://github.com/dotnet/runtime/issues/97316 that we had experienced with the region-based GC heap mode that needs to be addressed by the team for the new GC mode to deliver on its better proformance promise.

dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

mangod9 commented 3 months ago

thanks for reporting the issue @baal2000. Would you be able to capture GCCollectOnly traces for both .NET 8 regions and with clrgc (segments) so we could investigate the differences?

Also some clarifications:

  1. Are the OOMs expected since the working set is getting close to the limits with Regions?
  2. Are you observing any latency differences or the perf other than memory concerns is comparable?
baal2000 commented 3 months ago

@mangod9

baal2000 commented 3 months ago

CC: @cshung

cshung commented 3 months ago

The dotnet-trace command is available here:

https://github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md#how-to-collect-top-level-gc-metrics

I am not sure I can specify a duration for you. The duration needs to be long enough so that it is sufficient to look at the growth over time, but not too long that stress your disk space. In general, GCCollect traces are supposed to be lightweight, logging just before and after GCs, so you should be able to turn it on for hours without issues.

markples commented 3 months ago

I think an hour would be fine here. We'd like to see several GCs of the increase during regions and several of the oscillations under segments, so dotnet trace collect -p <pid> -o <outputpath with .nettrace extension> --profile gc-collect --duration 01:00:00

Additionally, since you have a dump, we could also use a few values from that.

?? coreclr!SVR::gc_heap::global_regions_to_decommit[0].size_committed_in_free_regions
?? coreclr!SVR::gc_heap::global_regions_to_decommit[1].size_committed_in_free_regions
?? coreclr!SVR::gc_heap::global_regions_to_decommit[2].size_committed_in_free_regions
?? coreclr!SVR::gc_heap::global_free_huge_regions.size_committed_in_free_regions

// for a sampling of x --  they should be similar though outliers would be interesting
?? coreclr!SVR::g_heaps[x]->free_regions[0].size_committed_in_free_regions
?? coreclr!SVR::g_heaps[x]->free_regions[1].size_committed_in_free_regions
?? coreclr!SVR::g_heaps[x]->free_regions[2].size_committed_in_free_regions

I believe you'll need to replace ?? with p and drop the coreclr!. Thank you!