dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

Running multiple .net core containers on a single node #100521

Open Vandersteen opened 5 months ago

Vandersteen commented 5 months ago

Hello all,

I'm looking for guidance surrounding best practices when running multiple .net core apps on a single node.

Base setup:

We have multiple .net core apps running on AKS, most / all of these different apps have a rather low / bursty (from 0 to a little) traffic. However we are encountering a whole range of timeout related issues.

All of these are very intermittent and hard to reproduce, no special "large load" can be detected in the Azure portal (dtu / cpu / ...) for the Azure resources, we've had multiple tickets in the past to let azure support investigate these without any clear reason why these happen.

We've put retry policies in place to alleviate these timeouts as much as possible, but I'm starting to think that our setup might be the reason this is happening.

As most of these apps have very low "traffic", we run these instances with low requests / limits

We are using nodes of 8Gi & 2 cpus, which means generally we have around ~10 containers running on a single node.

After reading this here:

If you're running hundreds of instances of an application, consider using workstation garbage collection with concurrent garbage collection disabled. This will result in less context switching, which can improve performance.

Server garbage collection can be resource-intensive. For example, imagine that there are 12 processes that use server GC running on a computer that has four logical CPUs. If all the processes happen to collect garbage at the same time, they would interfere with each other, as there would be 12 threads scheduled on the same logical CPU. If the processes are active, it's not a good idea to have them all use server GC.

I'm starting to wonder if the timeouts could be related, could it be that we are hitting a similar issue as above, and that we should run our containers with Workstation GC and disable concurrent garbage collection? Could the GC of another app affect other apps running on the same node (and cause timeouts / blocking / ...).

mangod9 commented 5 months ago

Hello @Vandersteen, Have you correlated the timeouts to memory utilization? Also by "single node" do you mean the same container instance or the same pod? Usually the recommendation would be that each service runs in its own container with some resource constraints.

Vandersteen commented 5 months ago

We have many different applications

Each of these applications are run in a separate pod (1 container per pod)

There are multiple instances of each pod spread across multiple nodes

Generally there are ~10 different pods (applications) per node (we have constraints to avoid putting the same application's pods on the same node)

Vandersteen commented 5 months ago

Have you correlated the timeouts to memory utilization?

Not yet no, I'll try and see if I can find something.

We've been having these kinds of issues since .net core 2.2 and have baked in 'retry policies' / accepted our faith on a lot of these timeouts in the last years.