Open Vandersteen opened 5 months ago
Hello @Vandersteen, Have you correlated the timeouts to memory utilization? Also by "single node" do you mean the same container instance or the same pod? Usually the recommendation would be that each service runs in its own container with some resource constraints.
We have many different applications
Each of these applications are run in a separate pod (1 container per pod)
There are multiple instances of each pod spread across multiple nodes
Generally there are ~10 different pods (applications) per node (we have constraints to avoid putting the same application's pods on the same node)
Have you correlated the timeouts to memory utilization?
Not yet no, I'll try and see if I can find something.
We've been having these kinds of issues since .net core 2.2 and have baked in 'retry policies' / accepted our faith on a lot of these timeouts in the last years.
Hello all,
I'm looking for guidance surrounding best practices when running multiple .net core apps on a single node.
Base setup:
We have multiple .net core apps running on AKS, most / all of these different apps have a rather low / bursty (from 0 to a little) traffic. However we are encountering a whole range of timeout related issues.
All of these are very intermittent and hard to reproduce, no special "large load" can be detected in the Azure portal (dtu / cpu / ...) for the Azure resources, we've had multiple tickets in the past to let azure support investigate these without any clear reason why these happen.
We've put retry policies in place to alleviate these timeouts as much as possible, but I'm starting to think that our setup might be the reason this is happening.
As most of these apps have very low "traffic", we run these instances with low requests / limits
We are using nodes of 8Gi & 2 cpus, which means generally we have around ~10 containers running on a single node.
After reading this here:
I'm starting to wonder if the timeouts could be related, could it be that we are hitting a similar issue as above, and that we should run our containers with Workstation GC and disable concurrent garbage collection? Could the GC of another app affect other apps running on the same node (and cause timeouts / blocking / ...).