Running multiple .net core containers on a single node

Vandersteen commented 5 months ago

Hello all,

I'm looking for guidance surrounding best practices when running multiple .net core apps on a single node.

Base setup:

We have multiple .net core apps running on AKS, most / all of these different apps have a rather low / bursty (from 0 to a little) traffic. However we are encountering a whole range of timeout related issues.

Azure SQL server timeouts
Azure MYSQL server timeouts
Azure Blob Storage timeouts
Azure Service bus timeouts
Http Client timeouts
...

All of these are very intermittent and hard to reproduce, no special "large load" can be detected in the Azure portal (dtu / cpu / ...) for the Azure resources, we've had multiple tickets in the past to let azure support investigate these without any clear reason why these happen.

We've put retry policies in place to alleviate these timeouts as much as possible, but I'm starting to think that our setup might be the reason this is happening.

As most of these apps have very low "traffic", we run these instances with low requests / limits

Memory 256M requests & limits
CPU
- due to the "bursty" nature, no limits have been setup, so they can use all of the cpu's of the nodes when needed, this is to avoid cpu throttling on aks / k8s side as these make the timeouts way worse
- In our monitoring, we can see that most apps never pass the 100m cpu usage on average, the nodes average around 400m cpu usage with some spikes to 1 cpu. Rarely / never are they using the full 2 cpu's available

We are using nodes of 8Gi & 2 cpus, which means generally we have around ~10 containers running on a single node.

After reading this here:

If you're running hundreds of instances of an application, consider using workstation garbage collection with concurrent garbage collection disabled. This will result in less context switching, which can improve performance.

Server garbage collection can be resource-intensive. For example, imagine that there are 12 processes that use server GC running on a computer that has four logical CPUs. If all the processes happen to collect garbage at the same time, they would interfere with each other, as there would be 12 threads scheduled on the same logical CPU. If the processes are active, it's not a good idea to have them all use server GC.

I'm starting to wonder if the timeouts could be related, could it be that we are hitting a similar issue as above, and that we should run our containers with Workstation GC and disable concurrent garbage collection? Could the GC of another app affect other apps running on the same node (and cause timeouts / blocking / ...).

mangod9 commented 5 months ago

Hello @Vandersteen, Have you correlated the timeouts to memory utilization? Also by "single node" do you mean the same container instance or the same pod? Usually the recommendation would be that each service runs in its own container with some resource constraints.

Vandersteen commented 5 months ago

We have many different applications

Each of these applications are run in a separate pod (1 container per pod)

There are multiple instances of each pod spread across multiple nodes

Generally there are ~10 different pods (applications) per node (we have constraints to avoid putting the same application's pods on the same node)

Vandersteen commented 5 months ago

Have you correlated the timeouts to memory utilization?

Not yet no, I'll try and see if I can find something.

We've been having these kinds of issues since .net core 2.2 and have baked in 'retry policies' / accepted our faith on a lot of these timeouts in the last years.

dotnet / runtime

Running multiple .net core containers on a single node #100521