Closed Kiechlus closed 1 year ago
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
Author: | Kiechlus |
---|---|
Assignees: | - |
Labels: | `area-GC-coreclr`, `untriaged` |
Milestone: | - |
Hey @Kiechlus, do you observe that it eventually get collected, or doesnt unless there is memory pressure on the K8s cluster?
@mangod9 When I issue another download of a 1 GB file, the memory does not rise, so it must have been collected. But without such pressure it just stays as is. On very few occations we even got OOM exception in such a scenario, but cannot reproduce it. (Pod memory limit is 2 Gi) Unfortunately I haven't found an easy way so far to log G2 collection events.
does any of the options listed here work for you?
@Maoni0 will those traces help to analyse the issue? In this case we can try to get them. We were playing around with https://docs.microsoft.com/de-de/dotnet/core/diagnostics/dotnet-trace but did not really know what to do with the outcome.
@Kiechlus yes, this is always the first step for diagnosing memory perf problems.
Hi @Maoni0 Please find attached the trace from our cluster : trace-cluster2.zip
Hi, my team is struggling with a similar issue; we're currently working on gathering some traces from our app. The premise, however, is exactly the same - we're uploading a large file into Azure Blob Storage and, while on local dev env everything works fine and after some time there's full GC invoked, on our k8s cluster we get frequent OOMs. We set workstation GC mode for this app and tried to tinker with LatencyMode and LOH Compaction Mode but, alas, with no luck. Currently I'm planning on investigating our code since i suspect the issue originates there, but maybe you have some insights. @Kiechlus can you share if you managed to fix the issue?
Can somebody provide some sample code for what the download or upload looks like? There are some known issues in with ASP.NET Core's memory pool not releasing memory that might be the case here but its possible that the code could be tweaked to avoid the memory bloat in the first place.
Hi, @L-Dogg we are still facing this issue. @davidfowl, this is internal code in one service and some libs, I cannot just share it. Could you give us some hints what to look for? I can then share relevant parts of it for sure.
Thanks for your replies. We're trying to prepare a minimal example, I just hope such an example will be enough to reproduce this behaviour.
@Kiechlus which version of Azure.Storage.Blobs
do you use?
@Kiechlus can you collect a gc-verbose trace to see where your allocations are coming from? Are the connections HTTPS connections?
@Kiechlus somehow I missed this issue...sorry about that. I just took a look at the trace you collected. it does release memory, if you open the trace in perfview and open the GCStats view you'll see this
at GC#10, the memory usage went from 831mb to 9.8mb. but there's allocations in LOH again which made the memory go up again. what would be your desired behavior? you are not under high memory pressure so GC doesn't need to aggressively shrink the heap size.
@Kiechlus It seems like you're churning the LOH, why is that? Are you using Streams or are you allocating large buffers?
@L-Dogg
we're uploading a large file into Azure Blob Storage and, while on local dev env everything works fine and after some time there's full GC invoked, on our k8s cluster we get frequent OOMs.
Are you using streams or are you allocating big arrays? Also are you using IFormFile or are you using the MultipartReader?
@davidfowl We are allocating the stream like this: var targetStream = new MemoryStream(fileLength);
, where fileLength
can be several GB. We made the experience that if we create the Stream without initial capacity, it will allocate much more memory in the end as the actual file size.
@Maoni0 You are right, we ran into OOM only in very few occasions. So memory is freed under pressure. But what we would need that it is freed immedeately after the controller returns and the Stream got deallocated.
Because the needs in a Kubernetes cluster are different. There are e.g. three big machines and Kubernetes schedules many pods on them. Based on different metrics it creates or destroys pods (horizontal pod autoscaling) @egorchabala.
If now some pod does not free memory eventhough it could, that means Kubernetes cannot use that memory for scheduling other pods and the autoscaling does not work. Also the memory monitoring is more difficult.
Is there any possibility to make GC release memory immedeately as soon as it is possible eventhough there is no high pressure? Do you still need a different trace or something the like?
@L-Dogg we are currently using version 11.2.2
.
@davidfowl We are allocating the stream like this: var targetStream = new MemoryStream(fileLength);, where fileLength can be several GB. We made the experience that if we create the Stream without initial capacity, it will allocate much more memory in the end as the actual file size.
Don't do this. This is the source of your problems. Don't buffer a gig in memory. Why aren't you streaming ?
@davidfowl We are using this client-side encryption https://docs.microsoft.com/de-de/azure/storage/common/storage-client-side-encryption?tabs=dotnet. We never figured out how to do with streaming the Http response to the (browser) client. Is it possible? Does it needs adaptions on the browser clients? They are not controlled by us.
We never figured out how to do with streaming the Http response to the (browser) client. Is it possible? Does it needs adaptions on the browser clients? They are not controlled by us.
Without seeing any code snippets, it's obviously harder to recommend something but I would imagine you have something like this:
See this last step:
// Download and decrypt the encrypted contents from the blob.
MemoryStream targetStream = new MemoryStream(fileLength);
blob.DownloadTo(targetStream );
Then something is either copying the targetStream to the HttpResponse? If yes, then avoid the temporary stream and just copy it to the response directly.
Hi @davidfowl thanks for your reply! Yes we have in some blobstore-related library exactly the code you said.
This is consumed by the service, goes through some layers, and in the end in the Controller it is:
var result = new FileStreamResult(serviceResponse.Content.DownloadStream, System.Net.Mime.MediaTypeNames.Application.Octet);
result.FileDownloadName = serviceResponse.Content.Filename;
return result;
I'm still not sure how to avoid writing it to some temporary stream. But would be very great if we could solve it.
Where is the temporary stream doing all the buffering?
@davidfowl Do you mean this?
BlobRequestOptions optionsWithRetryPolicy = new BlobRequestOptions
{
EncryptionPolicy = encryptionPolicy,
RetryPolicy = new Microsoft.Azure.Storage.RetryPolicies.LinearRetry(
TimeSpan.FromSeconds(errorRetryTime),
errorRetryAttempts),
StoreBlobContentMD5 = true,
};
StorageCredentials credentials = new StorageCredentials(accountName, accountKey);
BlobClient = new CloudBlobClientWrapper(new Uri(storageUrl), credentials, optionsWithRetryPolicy);
...
var container = BlobClient.GetContainerReference(...);
CloudBlockBlob destBlob = container.GetBlockBlobReference(blobId);
var targetStream = new MemoryStream(fileLength);
var downloadTask = destBlob.DownloadToStreamAsync(
target: targetStream,
accessCondition: null,
options: null,
operationContext: new OperationContext(),
cancellationToken: cancellationToken);
return await ReadWriteRetryPolicy.ExecuteAsync(
(context, token) => downloadTask,
pollyContext,
cancellationToken);
Why isn't this code taking in the target Stream?
@davidfowl Do you mean in the Controller we should do something like this?
MemoryStream targetStream = new MemoryStream(fileLength);
blobLib.DownloadTo(targetStream );
return new FileStreamResult(targetStream );
If this makes a difference we will for sure try.
The controller should look like this:
Response.ContentLength = fileLength;
await blobLib.DownloadToAsync(Response.Body);
Thanks for your help we will try this out
If now some pod does not free memory eventhough it could, that means Kubernetes cannot use that memory for scheduling other pods and the autoscaling does not work. Also the memory monitoring is more difficult.
this is the part I don't understand, perhaps you could help me. do you know how much memory each pod is supposed to get? imagine GC did kick in right after your large memory usage and Kubernetes packed more pods onto the same machine. but now your process needs to allocate the similar amount of memory again but now it can't. is this the desired behavior? is it an opportunistic thing? if you get OOM in one of your processes you treat it as a normal problem and just restart it?
Hi @Maoni0 I am not an expert with this but I can try to summarize the issues our Ops team reported.
In Kubernetes each pod defines a CPU/Mem request and a CPU/Mem limit [1]. When scheduling a pod, the scheduler reserves the requested resources.
However, a pod can consume up to its limit resources, before he sees OOM. If now many pods do not free memory as we have seen in our case, the schedular has less total resources on all machines to schedule new pods that are e.g. needed for horizontal pod autoscaling.
Additionally, the monitored memory does not represent the actual memory, as much memory could be freed by e.g. a full garbage collection.
@davidfowl This is to confirm that your solution works like a charm. Many thanks again.
[1] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
Great!
@Kiechlus thanks for your explanation, I was aware of the request/limit concepts and was really wondering how it plays out in practice. my question was around the existing processes getting OOMs, imagine this scenario, each pod specifies request as 100mb and limit as 200mb. you have total 1gb memory and 5 pods running already. each pod is currently using 100mb. the scheduler schedules another 5 pods on this machine. now pod#0 needs to use 150mb and it gets OOM, even though it has not reached its limit. how do you handle this scenario?
@Maoni0 The difference between request and limit is effectively the degree to which you are willing to allow overscheduling.
In your example, pod#0 would get evicted/oom-killed for having any value > 100Mb once the machine gets oversubscribed.
If your application can't handle working correctly (if perhaps more slowly) with only 100Mb, then it needs to have a higher 'request'.
Request == Guaranteed Limit == Optimistic/overcommitted
Pods that are > request are always at risk of being killed and restarted for out of memory.
@Kiechlus can you clarify what problems you are seeing with horizontal pod scheduling? The scheduler should only be considering sum(requested resources), not sum(limit resources) when scheduling.
If you are using Vertical AutoScaling, then I can understand the issue, since the vertical autoscaler might dynamically reduce the 'request' value for a Pod, which would make room for additional pods to schedule.
But if you're not using vertical auto-scaling then request and limit are fixed values, and only request (which is guranteed) should affect scheduling.
@brendandburns that makes sense. thanks!
Kindly forwarding to @egorchabala
@brendandburns not sure I understand this correctly:
@Kiechlus can you clarify what problems you are seeing with horizontal pod scheduling? The scheduler should only be considering sum(requested resources), not sum(limit resources) when scheduling.
The issue we have with Kubernetes HPA is that if we are not freeing memory, the HPA will be trying to keep number of our pods at maximum level and will never scale down them.
We're also seeing similar issues with linux pods holding on to memory but haven't had a chance to dig in very much. We have a very bursty workload where memory requirements spike up for pods for short periods of time and then are idle or lower most of the time. We have k8s memory request set to a normal usage ceiling and then limits as the upper boundary for spikes. The pods hold on to 200-300% of their request memory after the spikes even though when running locally and profiling in VS the GC reduces this ~90% after the peak (eg. k8s pod sitting at ~800mb used idle, local at <200mb idle). This could be some sort of linux specific memory leak (having issues getting dumps from k8s) but subsequent workloads don't increase peak memory use so would have to be some sort of startup leak.
I've played with some of the GC env vars which don't seem to have much affect and say there could be performance impacts. Ideally we wouldn't want any perf hits while under CPU load, I just want memory to return to baseline after the peak like it does locally so that kubernetes can make smart decisions regarding pod allocations etc.
We have a very bursty workload where memory requirements spike up for pods for short periods of time and then are idle or lower most of the time.
I am working on an API proposal that might help with the situation. If you know you are going into idle mode. you can call that API, and have the GC to release the memory for you.
We're also seeing similar issues with linux pods holding on to memory but haven't had a chance to dig in very much. We have a very bursty workload where memory requirements spike up for pods for short periods of time and then are idle or lower most of the time. We have k8s memory request set to a normal usage ceiling and then limits as the upper boundary for spikes. The pods hold on to 200-300% of their request memory after the spikes even though when running locally and profiling in VS the GC reduces this ~90% after the peak (eg. k8s pod sitting at ~800mb used idle, local at <200mb idle). This could be some sort of linux specific memory leak (having issues getting dumps from k8s) but subsequent workloads don't increase peak memory use so would have to be some sort of startup leak.
I've played with some of the GC env vars which don't seem to have much affect and say there could be performance impacts. Ideally we wouldn't want any perf hits while under CPU load, I just want memory to return to baseline after the peak like it does locally so that kubernetes can make smart decisions regarding pod allocations etc.
@plaisted We are facing a similar issue. After spikes the memory stays high so the HPA doesn't scale down the pods. Did you manage the memory to be returned to the OS either with GC env vars or with GC Workstation mode? Did any of these work?
@dotnet/gc
I am returning on this one cause we have issue to our k8s cluster and we are wasting a ton of resources. We have a simple api allocating 6 strings * 10kb . GC in Workstation mode.
var s1 = new string('z', 10 * 1024);
We hit the api with 600 rpm for 2 minutes. Memory goes up to to 230mb from 50mb initially and stays there forever.
When I run it locally the GC is called and memory is released fine as you can see in the dotmemory screenshot. I am trying to figure out if it is an issue of the GC or something else.
We have the same issue on kubernetes. Our apps doesn't release memory, and it can't scale down. It is not an issue on windows, only happening in kubernetes with linux containers. I saw in other posts about OOM and memory issues about .Net7 and as far I as saw some fixes release in .Net 7.0.3. I've tried .Net 7.0.3, it helped a little, not much. So I've decided to implement a prometheus metric which keeps the percentage of heap over pyhsical memory limit of the pod, so we can set this metric to hpa for scale down/up scenarios as a workaround. Now we can scale up/down by this percentage metric, but all of apps still use lots of resources. For instance, an app has 400mb heap and it still uses 11GB memory in kubernetes, and it doesn't release the memory, and the dump shows large amount of 'Free' memory. I've tried many things that described here:(https://github.com/dotnet/runtime/discussions/80854), but didn't find any permanent solution.
hi @PanosKof, when you said
When I run it locally the GC is called and memory is released fine as you can see in the dotmemory screenshot. I am trying to figure out if it is an issue of the GC or something else.
by "the GC is called" did you mean you are inducing GCs yourself or GCs are happening on its own?
hi @oruchreis, are you observing this issue with .net 7 but not with older versions of .net?
@PanosKof and @oruchreis, would it be possible to capture a top level GC trace to help with the diagnosis? this is the very first step at diagnosing a memory problem. it's described here. it's very low overhead so you can keep it on for a long time. if this problem shows up pretty quickly you could start capturing right before the process is started and terminate tracing when it's exhibited the "memory not being released and the heap size is too large" behavior.
however the method I described in the doc may not be applicable in your environment. if you cannot actually execute the dotnet trace command, you'll need to use dotnet monitor to capture such a trace. this page describes how to enable it with a sidecar container so you can get a trace. if you expand the .NET Monitor 7+ arrow you will see the container spec for monitoring. the pertinent part is
- name: monitor
image: mcr.microsoft.com/dotnet/monitor
# DO NOT use the --no-auth argument for deployments in production; this argument is used for demonstration
# purposes only in this example. Please continue reading after this example for further details.
args: [ "collect", "--no-auth" ]
imagePullPolicy: Always
env:
- name: DOTNETMONITOR_DiagnosticPort__ConnectionMode
value: Listen
- name: DOTNETMONITOR_Storage__DefaultSharedPath
value: /diag
# ALWAYS use the HTTPS form of the URL for deployments in production; the removal of HTTPS is done for
# demonstration purposes only in this example. Please continue reading after this example for further details.
- name: DOTNETMONITOR_Urls
value: http://localhost:52323
you can change the port if you need to.
and then you can issue a post command to get a trace, here's an example (obviously you'd need to change the Authorization part) -
POST /trace?pid=21632&durationSeconds=60 HTTP/1.1
Host: localhost:52323
Authorization: Bearer fffffffffffffffffffffffffffffffffffffffffff=
Content-Type: application/json
{
"Providers": [{
"Name": "Microsoft-Windows-DotNETRuntime",
"EventLevel": "Informational",
"Keywords": "0x1"
},{
"Name": "Microsoft-Windows-DotNETRuntimePrivate",
"EventLevel": "Informational",
"Keywords": "0x1"
}],
"BufferSizeInMB": 1024
}
more detail is at https://github.com/dotnet/dotnet-monitor/blob/main/documentation/api/trace-custom.md#examples.
as you can see, this is not exactly straightforward to set up. @hoyosjs helped me to contact the dotnet monitor team to hopefully make this easier.
if you have dumps you could share, I'd be happy to take a look. usually dumps aren't great for this kind of analysis but if that happens to be all that's available, we'll try to get the most out of it. we are aware that there's a known issue in .NET 7 if you are basically only doing BGCs, but since I don't know what kind of GCs you are doing (that's one of the (many pieces of) info that a trace will give us)), I can't say if that's what you are hitting. I know that in @oruchreis's case, since they already mentioned they are seeing a lot of objects with the type Free (I think this is what they meant, let me know if I interpeted it wrong) they are not hitting that issue.
Hello @Maoni0 , thanks for your concern. As you mentioned in the document, I've collected 1800 seconds of trace from the start of the application (here). I don't know if it matters but the application is a CoreWCF project and every 10 minutes if the application is idle and sometimes when the memory peaks, it manually calls GC.Collect in Aggressive mode. Although manually triggering Gc.Collect doesn't seem to work well in kubernetes environments because of this issue, but in windows environments it helps to reduce the overall memory usage. When the trace collection was finished, the memory size of the application was showing about 20GB on the kubernetes dashboard and the total heap size was about 500mb
hi @oruchreis thank you very much for the trace. I just took a look. this does not show the symptom you described above -
the dump shows large amount of 'Free' memory.
if you captured a dump most of the 20GB would not be in "Free" memory (with the same assumption by that you meant the objects whose type is marked as Free) because the max heap size is only 1.9GB.
so there are 2 possibilities -
1) most of the memory is indeed retained by the GC but not in use, ie, it does not belong to any generation therefore does not show up in GCStats. it would definitely be nice if we exposed this info in GCStats but before we do that, you could diagnose this by looking at the dump you captured. there are symbols you can look at in libcoreclr that give you this info. is this something doable? I usually look at the linux dumps in windbg on windows. this comment shows how to dump one of the sources of this retention. if this is doable, I can give you a full list of symbols to look.
if 1) shows there's not much memory retained by the GC, it means it's used by something else.
Hi @Maoni0, I've taken a lot of dump and analyzed before and likewise the amount of Free objects is huge, but I would be happy to take another dump and look at the symbols you will give me to solve this problem. Generally what I see in K8s Linux dumps is that if the memory is 20GB, the total heap size is about 400-500mb and the amount of Free objects is about 19GB. But I don't see this in Windows dumps with the same code. If the heap is 400-500mb, the memory usage is at most 2-3GB. This is the behavior of two different environments with the same code. We also see the same behavior in our different applications. But especially in this application I sent the trace of, we see this difference more. Because in this application there is a lot of roslyn CSharpScript
and a lot of dynamic assemblies are created because of the CSharpScript
s. I checked them with the help of this document and they are unloading properly, and there aren't any LoaderAllocator
connected to this scripts. Also there are native interop operations common to all applications, and we have native libraries that use at most 300-400mb of memory.
As I said, the interesting part is the significant difference in memory usage between the application with the same code in the Windows environment and in the Linux Kubernetes Container.
As a side note, when I use mimalloc
, jemalloc
or tcmalloc
with LD_PRELOAD
, the memory usage is dropped by half from 20GB to about 10-11GB.
If you send me the symbols I would be willing to look at them in the dump. I can even share the dump file privately if you request.
Thanks again.
if the memory is 20GB, the total heap size is about 400-500mb and the amount of Free objects is about 19GB.
could you please clarify exactly what you mean by this? a Free object is part of the heap. so if the Free objects add up to about 19GB, your heap size will be at least that. when I say Free object, I meant objects whose type is Free (in other words if you do !dumpheap -type Free you will see those objects). but that's clearly not what you meant by Free objects.
This issue has been marked needs-author-action
and may be missing some important information.
Hi @Maoni0, yes my mistake, by "Free objects" I meant the amount that is not actually used in memory and needs to be released by GC. For example, the last dump size I got is about 18GB, and the heap size is only about 400mb. When I look at it with WinDbg, the command !dumpheap -type Free
gives the following output:
Count TotalSize Class Name
345354 8841424 Free
So if I'm not mistaken, it shows Free
up to 8mb. As far as I know, GC does not release all the unused space back to the operating system. So in this case, the private memory which is normally about 1.5GB in this heap size in Windows environment, the memory size used by the pod in kubernetes linux container environment seems to be 18GB. The interesting thing is, as I said before, if I use a memory manager like mimalloc with LD_PRELOAD, this amount is half. But in any case I don't know why the memory usage is so high. This only happens in kubernetes in a Linux container environment, we don't have this problem in Windows.
If there was a memory leak, wouldn't the heap size be closer to 18GB? And wouldn't that show up in all the dumps I get? But as far as I can see, I see a maximum heap size of 1.5GB in the dumps. Somehow GC is not releasing the unused amount back to the OS. If I set the pod's memory limit to 20GB, after a certain time the memory used by the pod reaches 20GB. But the heap size is 2 or 3% of that. The size of the heap increases during load but then decreases again when the load is gone. We don't encounter OOM at this stage. But the total memory size doesn't decrease either and stays same close to pod's memory limit, as in this example stays at 20GB. This prevents us to scale down our application. It also affects the resource utilization very badly.
Our prod applications are hosted on azure AKS but we have custom k8s installations in our dev and test environments and the situation is the same there.
I never see pod memory going down too, it simply keep increasing, my node memory working set percentage is 120% something
To monitor gc need to be easier, can we not simply log this? Gc running and sucessfully free memory xxx bytes
hi @oruchreis, is it possible to share a dump? if not, do you think you could look at the symbol I mentioned in this comment to see if the memory is used by the GC but not on the heap? there's another data structure that you can check. let me know if you can do this at all and I can give the name of the other one. we are putting in diagnostic info for this so it will be a lot easier to get this info.
the fact that changing the native allocator changes your memory usage by so much makes it very unlikely the memory is used by the GC - GC certainly does not use malloc to allocate memory for the GC heap. we do use new here and there to allocate some datastructures but they can't amount to so much memory unless something is seriously wrong (and would be the 1st time it happened).
@Maoni0 Sorry for my late response. The GC was called by it's own in the local environment. I realized that they weren't defining memory limits in the pods, thus the 75% heap size wasn't applied. After defining memory limits the memory in the pods doesn't go crazy up and stays within the limits with no performance degradation. Also we have defined autoscaling at 85% to allow to scale down. However even when there is a small amount of requests the pod memory is stuck close to the limits. I will try to capture a top level GC trace.
hi @PanosKof a top level GC trace would be great. however, I should point out that currently we do not have logic to read the request value that you specify (so we don't react to it - this is something we're working on to address). but I'd still be interested to see a trace so I can perhaps suggest something you can do in the mean time.
Description
Generation 2G
collection, smaller generations do not releaseConfiguration
htop
inside container:Other information