Closed vlerenc closed 1 month ago
@shreyas-s-rao commented out-of-band on a Slack (I contacted him before I created this BLI proposal afterwards):
Isn’t HVPA being dropped for etcd?
Yes, the plan is to generally drop it, but when? Until then, we could run more efficiently.
Bringing minAllowed down for the etcds is ok, but only as long as it doesn’t cause too many restarts when the etcd pods start up.
Yes, I know that this is one reason why we have it, but the inefficiencies are high. I was checking out the usage and my recommendations are based on them.
Having said that, when the pods need to grow vertically through multiple more OOMs, it will take longer, but then again, that's the same also for GKE and others. Furthermore, even though we have bin-packing, there is often more than enough head room on a node (after all, we have no pod limits/guaranteed qos, so the node is the limit) and OOMs won't happen. Especially not with the safe-to-evict
annotations, but also without I am doubtful that will be an issue.
Hibernated shoots won't be affected either.
The backup-restore container for does more than backup and restore at the moment - it does data validation as well.
Yes, I know that there is the initial hand-shake and validation, but my thinking is:
23m
and 128Mi
are chosen how? These numbers for CPU and memory is required and sufficient how (also for large clusters)?23m
and 128Mi
are magic numbers that will hold true in all cases (we do need a lot less and will probably need a lot more for larger clusters), but the validation load happens only at startup and not continuously. These are tiny amounts, CPU is even compressible, and nodes will have the headroom for that.It makes sense from a cost perspective to actually run all operations within the etcd container itself, so that the same resources that are used for DB validation and restoration can then be continued to be used for the etcd process as well, thereby saving costs on the backup-restore container.
Yes, long-term that would be great. We didn't go there, becaue of build-time reasons only, if I remember correctly. However, these are larger changes for later, I guess.
So coming back to your question, it isn’t currently easy to reduce requests for the backup sidecar.
I wouldn't agree - do we have tickets indicating side-car container issues? We are just careful and would like "to avoid touching a running system", but are these numbers not waste and how are they sufficient for large clusters then?
If it needs the resources, it needs the resources.
But that's not how it's set up. It's all static - is validation and restoration really static at 23m
and 128Mi
? And as I said, the buffer is the node.
And currently we have a problem that if a full DB validation or restoration is required, then the backup sidecar needs more resources, so it would frequently restart as the VPA recommendation slowly increases for it, and finally stabilizes at some point when restoration finally succeeds.
So, you are saying that 23m
and 128Mi
are sufficient for large ETCDs? How did you pick these numbers?
Furthermore, isn't it the case that when and while the validation happens, ETCD itself is on hold (when the pod started, it hasn't even consumed any relevant resources yet). That means, that the cgroup isn't burdened by the usage of the main container. The main container is always idle in this phase, so CPU is plentiful/must be available, because the main container got it and isn't using it in validation at all. The side car can have 0 CPU and will still get enough of it, because it's in the very same cgroup also the main container is that must have everything that is required for normal operation anyway (but that idles). As for memory, it depends on when the validation runs. As far as I remember, the validation runs at the start of the pod and then never again, but at the start of the pod, the main ETCD cannot have consumed relevant memory resources yet, so the same argument will probably also hold for memory?
This consideration is about the events ETCD, but I am also wondering about the main ETCD. If that holds true also there, the main ETCD side car needs only the resources for the backup process as the restoration is similar to the validation phase. Nothing needs to be considered (even 0 CPU and 0 memory would be fine as the main container has the allotted resources and isn't using them yet obviously if a restoration is necessary). Of course, it shouldn't be 0 CPU and 0 memory because of the parallel backup process, but that's really all that needs to be calculated/considered; no validation and no restoration has to be considered, because at these times, the main container has not and is not using any relevant resources yet. WDYT?
The side car can have 0 CPU and will still get enough of it, because it's in the very same cgroup also the main container is that must have everything that is required for normal operation anyway
All containers within a pod will be under the same cgroup hierarchy but each container will get its own cgroup (k8s already supports cgroups v2) via cgroup namespace
. But since there are no limits defined at the container level in the etcd pod, the etcdBR container will be able to use more resources than requested and as you stated above, if the main etcd
container is not using requested resources, those will then be made available to the etcdBR container temporarily if the resources required are more than requested at the container level.
Meeting minutes from an out-of-band discussion between myself, @vlerenc, @unmarshall , @ishan16696 , @renormalize , @seshachalam-yv, @ashwani2k :
minAllowed
for etcd
container for both etcd-main
and etcd-events
, as recommended by @vlerenc , but to 500Mi
and 100Mi
respectively (to be on the safer side for etcd-main
)backup-restore
container, it would be safe to reduce to 5m
and 5Mi
for etcd-events
, as recommended by @vlerenc , since the etcd process runs only in one of the two containers at any given point of time, and the requests set aside for the etcd
container can be "re-used" by the backup-restore
container during initialization, so the total requests for the backup-restore
container itself can be minimal, which are mainly used for the general functioning of the backup sidecar (snapshotting when backups are enabled, and also maintenance (defragmentation)).Action items:
backup-restore
container for etcd-main
(backups enabled), to check how memory is used, especially during initialization (DB validation + possible restoration).Here are some sample stats on the 4 ETCD containers:
test_landscape_foo:
etcd-main:
etcd:
metrics:
allocatable:
cpu: '8421'
memory: 31.1T
efficiency:
used_requested_cpu: 27 %
used_requested_memory: 39 %
requests:
allocatable_cpu: '674'
allocatable_max_cpu: 1836m
allocatable_max_memory: 28G
allocatable_memory: 3.1T
containers: 2592
pods: 2592
usage:
allocatable_cpu: '182'
allocatable_max_cpu: 1474m
allocatable_max_memory: 5.8G
allocatable_memory: 1.2T
allocatable_upper_percentile_cpu: 148m
allocatable_upper_percentile_memory: 918.6M
samples: 2588
pod_issues:
ContainerRestarted: 10
Error: 4
OOMKilled: 1
Unhealthy: 89
Unknown: 5
backup-restore:
metrics:
allocatable:
cpu: '8421'
memory: 31.1T
efficiency:
used_requested_cpu: 51 %
used_requested_memory: 49 %
requests:
allocatable_cpu: '60'
allocatable_max_cpu: 23m
allocatable_max_memory: 128M
allocatable_memory: 324G
containers: 2592
pods: 2592
usage:
allocatable_cpu: '31'
allocatable_max_cpu: 104m
allocatable_max_memory: 541.7M
allocatable_memory: 158.3G
allocatable_upper_percentile_cpu: 21m
allocatable_upper_percentile_memory: 103.2M
samples: 2588
pod_issues:
ContainerRestarted: 22
Error: 13
Unknown: 4
etcd-events:
etcd:
metrics:
allocatable:
cpu: '8421'
memory: 31.1T
efficiency:
used_requested_cpu: 14 %
used_requested_memory: 26 %
requests:
allocatable_cpu: '309'
allocatable_max_cpu: 300m
allocatable_max_memory: 2.1G
allocatable_memory: 1.3T
containers: 2592
pods: 2592
usage:
allocatable_cpu: '43'
allocatable_max_cpu: 199m
allocatable_max_memory: 2.5G
allocatable_memory: 330.7G
allocatable_upper_percentile_cpu: 38m
allocatable_upper_percentile_memory: 289.8M
samples: 2588
pod_issues:
ContainerRestarted: 17
Unhealthy: 69
Unknown: 17
backup-restore:
metrics:
allocatable:
cpu: '8421'
memory: 31.1T
efficiency:
used_requested_cpu: 35 %
used_requested_memory: 26 %
requests:
allocatable_cpu: '60'
allocatable_max_cpu: 23m
allocatable_max_memory: 128M
allocatable_memory: 324G
containers: 2592
pods: 2592
usage:
allocatable_cpu: '21'
allocatable_max_cpu: 23m
allocatable_max_memory: 101.8M
allocatable_memory: 84.7G
allocatable_upper_percentile_cpu: 13m
allocatable_upper_percentile_memory: 40.1M
samples: 2588
pod_issues:
ContainerRestarted: 22
Error: 5
Unknown: 17
test_landscape_bar:
etcd-main:
etcd:
metrics:
efficiency:
used_requested_cpu: 21 %
used_requested_memory: 40 %
requests:
allocatable_cpu: '1945'
allocatable_max_cpu: 4000m
allocatable_max_memory: 30G
allocatable_memory: 8.3T
containers: 6684
pods: 6684
usage:
allocatable_cpu: '415'
allocatable_max_cpu: 1256m
allocatable_max_memory: 4.4G
allocatable_memory: 3.3T
allocatable_upper_percentile_cpu: 112m
allocatable_upper_percentile_memory: 945M
samples: 6683
pod_issues:
ContainerRestarted: 26
Error: 1
OOMKilled: 5
Unhealthy: 69
Unknown: 15
backup-restore:
metrics:
efficiency:
used_requested_cpu: 37 %
used_requested_memory: 41 %
requests:
allocatable_cpu: '154'
allocatable_max_cpu: 23m
allocatable_max_memory: 128M
allocatable_memory: 835.5G
containers: 6684
pods: 6684
usage:
allocatable_cpu: '57'
allocatable_max_cpu: 146m
allocatable_max_memory: 828M
allocatable_memory: 341.3G
allocatable_upper_percentile_cpu: 15m
allocatable_upper_percentile_memory: 81M
samples: 6684
pod_issues:
ContainerRestarted: 33
Error: 16
Unknown: 13
etcd-events:
etcd:
metrics:
efficiency:
used_requested_cpu: 40 %
used_requested_memory: 55 %
requests:
allocatable_cpu: '307'
allocatable_max_cpu: 300m
allocatable_max_memory: 4.3G
allocatable_memory: 1.8T
containers: 6684
pods: 6684
usage:
allocatable_cpu: '124'
allocatable_max_cpu: 331m
allocatable_max_memory: 2G
allocatable_memory: 1022.4G
allocatable_upper_percentile_cpu: 32m
allocatable_upper_percentile_memory: 282.9M
samples: 6684
pod_issues:
ContainerRestarted: 86
Unhealthy: 38
Unknown: 86
backup-restore:
metrics:
efficiency:
used_requested_cpu: 26 %
used_requested_memory: 29 %
requests:
allocatable_cpu: '154'
allocatable_max_cpu: 23m
allocatable_max_memory: 128M
allocatable_memory: 835.5G
containers: 6684
pods: 6684
usage:
allocatable_cpu: '41'
allocatable_max_cpu: 19m
allocatable_max_memory: 136M
allocatable_memory: 240G
allocatable_upper_percentile_cpu: 10m
allocatable_upper_percentile_memory: 41.8M
samples: 6684
pod_issues:
ContainerRestarted: 100
Error: 14
Unknown: 82
Unless we want to put the backup-restore
sidecar under VPA, the 95th percentile for CPU and memory are ~20m
and ~100M
(above called upper_percentile
). If you would go for the maximum, it would be almost 10x as much and far beyond the current requests.
In summary, we decided already to set smaller requests for the etcd-events
sidecar container:
23m
and 128Mi
to 5m
and 5Mi
And now we probably also can lower the requests for the etcd-main
sidecar container slightly, if you feel comfortable with the 95th percentile:
23m
and 128Mi
to 20m
and 100Mi
FYI: I updated the ticket main description based on the above numbers.
@shreyas-s-rao I hope we can get the first set of changes in quickly (the side cars only when we change the statefulset anyway)? Those are relatively straight-forward, therefore simple, with minimal risk (if at all), and are all data-driven. Whatever is more complicated can be done later, but let's get those "trivial" number changes in. That would be great.
Yes @vlerenc . Changes from this issue as well as #767 are to be made in g/g codebase, after we make the next druid release and update that in g/g (to avoid multiple rollouts of etcd). The next druid release will include the large PR #777 , hence it might take some more time for it to be reviewed and merged before we can make a druid release. Hope the delay is ok.
If it's still urgent, and since it technically doesn't depend on the new druid changes, and will help in immediate cost savings, we can make the changes immediately in g/g, which will go in the next g/g release. WDYT?
It depends on what the chances are for the review @shreyas-s-rao. :-) It's not urgent and we'd like to bundle our changes to roll less often. If the review process takes years (like with GEP-23), I'd rather get it in before that. If it's weeks, then don't rush it. If it's months, let's talk how many.
Haha sure. It's definitely just weeks, so we could possibly wait.
Hello, Do we have an ETA on reducing ETCD requests? This task is overlapping with an undergoing less focused initiative to optimise overall resource configuration for the garden control plane. Ideally, the ETCD part should be covered here, so it can benefit from expert knowledge.
Hi @shreyas-s-rao, is there any update on the first 2 bullet points (dropped the last bullet points because they are tracked now elsewhere), i.e. right-sizing minAllowed
and the initial requests
?
@vlerenc as mentioned earlier, this change needs to be made on g/g, and ideally clubbed with an etcd rollout, like the upcoming druid:v0.23.0 release. I will make the change to the requests for backup sidecar when integrating druid v0.23.0 into g/g. Hope that's ok with you. If not, then we can still make the change immediately on g/g, which will cause a second rollout of the etcd pods.
Hi @shreyas-s-rao. Sure, we can club it together. I am only after getting it done - nothing has to be done immediately, but eventually would be nice (there are so many small items everywhere and this one here is from April).
BTW, you said requests
for the side-cars (second part), but the first part is about minAllowed
for the main containers. I am fine if you do them together with the requests
changes, but a.) you didn't mention the minAllowed
changes in your reply and b.) it will not roll anything. Just for transparency (in this ticket, so that I know when to look again), do you intend to do them immediately or together with the other changes?
@shreyas-s-rao @ashwani2k @unmarshall Will this come with 0.23
? I see the values are still unchanged and 0.23
is already in.
Not changing the values as recommended in this ticket skews the data and the actual requests on all our landscapes. I see in the data, that we have an accumulation at 700M resp. 200M and we waste through that and the initial requests money.
Worse, it skews the CPU to memory ratio and blocks machine type optimizations.
I know I said that it's not critical and it wasn't, but considering that this ticket is now open since almost half a year, please make it a priority to change the value now. Not in one month, but right now.
We have hundreds of components that can be optimised, but because ETCD is a large one - with a dedicated pool even - I opened a ticket just for this component for you with new values extracted from data. It actually matters how much resources it consumes and that the actual CPU to memory requests are not skewed by wrong minAllowed
and overall the usage drops to reasonable values by right minAllowed
and initial requests
. It's time, if you haven't done it already. And if you have, apologies.
Hey @vlerenc . The plan is (and always was) to change these values in the g/g integration PR which upgrades druid to v0.23.0. Apologies if I previously meant that the changes you recommended would be done to the default values in druid itself.
Since g/g is a consumer of druid, and the consumption is not simply using a statefulset with specific initial resource requests, but also along with a VPA with minAllowed
values, and the VPA resource deployed by gardenlet, it makes sense to make the recommended changes from this ticket in the gardenlet code that creates the Etcd and VPA resources, rather than defaults (ideally gardener (or any code for that matter) should not be relying on default values from upstream components).
Hope this clears the air. Apologies for any previous miscommunication on my part.
@shreyas-s-rao Yes, you explained so, but I do not see a PR - all I see is https://github.com/gardener/gardener/pull/10506.
Can you please point me to it? I pulled the code prior to writing my comments and see e.g. in pkg/component/etcd/etcd/etcd.go
still the old values and no PR that would change them. Am I looking wrong?
Yet to raise the PR on the repo. Running some final tests. But rest assured, the changes from this ticket will be part of the PR.
Thank you @shreyas-s-rao and apologies for not asking first out-of-band.
How to categorize this issue?
/kind enhancement
Stories
What would you like to be added
minAllowed
for memory (CPU has none, so nothing to change) as we see landscapes with a very high churn rate and many thousand very small clusters with Gardener (makes it extra easy to get to a new cluster within minutes only to keep them for a day of development or even only some minutes to run a test suite):etcd-main
from700M
down to300M
etcd-events
from200M
down to60M
requests
for the static side car containers:etcd-main/backup-restore
from23m
and128Mi
to20m
and80Mi
etcd-events/backup-restore
from23m
and128Mi
to10m
and40Mi
(or drop it from etcd-events, i.e. integrate the validation into the main container, but that seemed difficult in the past without rebuilding the image)Motivation (Why is this needed?)
We see that requested CPU and memory is not optimally used:
etcd-main
uses ~30% of its requested CPU and ~20% of its requested memory (atypical, normally CPU is less well used)etcd-events
uses ~10% of its requested CPU and ~5% of its requested memory (atypical, normally CPU is less well used)