dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

【please help】Unmanaged memory only increases but does not decrease #6556

Closed lfzm closed 2 years ago

lfzm commented 4 years ago

Orleans version:v3.1.7 .NET Core : v3.1

image

Can provide DotMemory snapshot。

ReubenBond commented 4 years ago

Perhaps the GC is not releasing that memory back to Windows. What are your GC settings, is ServerGC on?

lfzm commented 4 years ago

@ReubenBond Yes, turn on ServerGC

HermesNew commented 4 years ago

Turn off the ServerGC,maybe solve the issue.

lfzm commented 4 years ago

It can be solved by shutting down ServerGC. Does Orleans plan memory space in advance for performance?

ReubenBond commented 4 years ago

Orleans does not perform any special unmanaged memory allocations in order to reserve space. There are some buffer pools, but those are managed and grow dynamically.

pipermatt commented 4 years ago

I may be having this same issue, though I am on Orleans v3.1.6. I had upgraded from Orleans v.3.1.0 ...

image

I'll let you guess where the deploy and the rollback were... 😂

sergeybykov commented 4 years ago

@pipermatt What's the deployment environment here - Windows/Linux, .NET Core/Full Framework, which version?

I don't see anything in the fixes between 3.1.0 and 3.1.6 that could obviously change memory allocation profile. Did you upgrade anything else at the same time by chance?

Have you tried taking and analyzing memory dumps to see what the memory is used by?

pipermatt commented 4 years ago

Linux, .NET Core 3.1... On average memory utilization seems to increase at a rate of about 15MB/hr... and there's just SO much allocated it's difficult to wade through it all via command line in Linux. I'm not an expert in the profiling tools, that's for sure.

I seem to be able to reproduce the behavior locally on my MBP as well... but dotnet-dump doesn't support Mac OS X. 😏 So I've been ssh'ing into a test Linux instance to try to diagnose. Tomorrow I may grab a Windows machine so I have the full benefit of PerfView, dotTrace, etc... but first, since I can reproduce locally, I'm methodically stripping down our configuration to as barebones as possible one feature at a time.

We did upgrade several other libraries that are called by our grain code, but the memory leak is apparent on an idle silo that isn't taking any traffic and doesn't have any of our grains instantiated yet.

We'll get it figured out... and will report back. 👍

pipermatt commented 4 years ago

After stripping my silo of features until it was about as basic as possible, I came to the conclusion that what I was seeing locally was a red herring and not indicative of the problem I saw in production. On a whim, I rolled forward to the release that deployed just before the available memory tanked...

image

You can see a dip where the deploy happened for each silo node, but it is humming along just fine. So now without a real reproduction case, I'm going to have to shelve this investigation unless it rears it's head again.

sergeybykov commented 4 years ago

Interesting. Did you also upgrade to 3.1.7?

pipermatt commented 4 years ago

I have not yet, tho I think I also spoke too soon... memory is going down again steadily... which matches the rate it did before (the first graph was zoomed out to a much larger time range)... image

sergeybykov commented 4 years ago

If memory does indeed leak over time, need to look at the memory dumps or GC profiles. @ReubenBond might have suggestion how to do that in an non-invasive manner.

pipermatt commented 4 years ago

Yeah, I'm working that angle now, though not on the production servers (yet). I think I'm seeing the exact same behavior with this build in my TEST environment, so I'm working on memory dumps there.

pipermatt commented 4 years ago

Update: there was another difference discovered. ;)

The version that appeared to be leaking memory had its LogLevel set to Debug... we had been running at LogLevel.Information previously. We weren't actually seeing a memory leak... we were seeing Linux allocate more and more memory to disk caching to buffer the writes to the system journal. This memory was always reclaimed when the Silo needed it, though this process itself was slow enough that we would see a spike of errors while it happened.

The tidbit that we didn't understand was why on redeploy, not ALL the memory that had been used was freed. Now, it makes perfect sense... because the silo process wasn't the one using it at all... Linux itself was. Eventually the OS decreased the cache allocation when we rolled back to the version that had LogLevel.Info and it no longer needed so much memory caching to keep up with the journal writes.

Mystery solved!

sergeybykov commented 4 years ago

Thank you for the update, @pipermatt! Makes perfect sense. This reminds me again how often misconfigured logging may cause non-obvious issues.

@lfzm Have you resolved your problem? Can we close this issue now?

HermesNew commented 4 years ago

@pipermatt Haha,Mystery has not been solved. My LogLevel is Warn.Before turn off ServerGC, the memory is between 1.1G and 1.5G.After turn off ServerGC, the memory is between 320M and 350M. At present, I solve the problem of consuming a lot of memory by turn off ServiceGC. This problem has been around for a long time.

ReubenBond commented 4 years ago

@HermesNew I believe this is most likely a ServerGC (.NET Core) concern, rather than something specific to Orleans. It might be worth looking at the various GC settings in the documentation here: https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#systemgcretainvmcomplus_gcretainvm, in particular, RetainVM might be of interest

lfzm commented 4 years ago

@ReubenBond Start a simple Orleans, use JetBrains dotMemory to detect that there will be Unmanaged memory. So I suspect Orleans’ problem

ReubenBond commented 4 years ago

Not necessarily. The GC deals with unmanaged memory, Orleans does not

HermesNew commented 4 years ago

@ReubenBond Maybe this GC setting is the best setting. `false

true`
ReubenBond commented 4 years ago

I don't recommend it. I recommend keeping server GC enabled if you are running in production. Are you running in a Linux container? You can set a limit on the maximum amount of memory used if you want. Note that ServerGC uses one heap per core by default, but you can reduce that using another setting.

SebastianStehle commented 4 years ago

With .NET Core 3.0, the runtime should just respect the cgroup limits.

ReubenBond commented 4 years ago

Yep, by default it will allow up to 75% of the cgroup memory limit. CPU limits also play a part in determining the number of heaps. In this case, I think it's probably running on windows, but I'm not sure.

HermesNew commented 4 years ago

@ReubenBond Is in production,running on windows server 2012R2. It is working very well now, and memory usage is well controlled after turn off ServerGC.

BTW: Orleans version:v3.1.7,.NET Core : v3.1

HermesNew commented 4 years ago

@ReubenBond I am now preparing to migrate to the linux container.So I want to know the best settings. Based on current practice, this setting is optimal.

ReubenBond commented 4 years ago

Is that unmanaged memory causing the application to terminate? Does it grow forever, or just for a few hours? I would imagine that things hit a steady state rather quickly?

HermesNew commented 4 years ago

The greater the load, the greater the memory consumption, and the memory will not decrease.It will causing the application to terminate.It will cause OOM Exception.

ReubenBond commented 4 years ago

Are you saying that you are seeing OOM exceptions?

HermesNew commented 4 years ago

When the application terminate will throw oom exception.I have analyzed the dump file,mainly unmanaged memory.My program has no memory leak. So this is what puzzles me.

HermesNew commented 4 years ago

@ReubenBond Start a simple Orleans, use JetBrains dotMemory to detect that there will be Unmanaged memory. So I suspect Orleans’ problem

According to @lfzm methods, this problem can reappear.

ReubenBond commented 4 years ago

Can you share the crash dump?

Cloud33 commented 4 years ago

I'm here https://blog.markvincze.com/troubleshooting-high-memory-usage-with-asp-net-core-on-kubernetes/

It is seen that it seems that because dotnet recognizes the number of CPUs in Docker there is an error, if you use Server GC, it will consume a lot of memory, in Docker it is recommended to turn off the Server GC, use Workstation GC

  <PropertyGroup> 
    <ServerGarbageCollection>false</ServerGarbageCollection>
  </PropertyGroup>

Have you heard that there seems to be a problem with CPU recognition errors? 😉 https://github.com/dotnet/runtime/issues/11933

HermesNew commented 4 years ago

The dump file is large. I turn off the ServerGC.At present, there is no problem of excessive memory usage.

ReubenBond commented 4 years ago

@Cloud33 that advice no longer applies. The GC recognises CPU limits present in the container and adjusts heap count accordingly. Additionally, you can set the memory limit (and it's also detected from the container's cgroup).

@HermesNew You can set a memory limit if you want. If you do, do you still see OOM exceptions? How long does the application run for before crashing with an OOM?

Cloud33 commented 4 years ago

@ReubenBond Ok

srollinet commented 3 years ago

We are experiencing memory issues in production on 2 linux servers running an Orleans cluster. For now, we don't know if it is related to Orleans or not.

EDIT

oops, my bad it wasn't the processes I thought that were eating the memory... I should learn how to read ps results in linux :P, sorry for the post...

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.