Discuss how to calculate overhead when setting resource allocation choices

consideRatio commented 1 year ago

I chatted with Yuvi about node sharing setups, and I'd like to summarize what I think could be a good strategy to use in https://github.com/2i2c-org/infrastructure/pull/3030.

This issue is just a way to document my proposed strategy for #3030, and it doesn't have an action point to track by itself. Let's close it once this has been considered and 3030 has been merged.

Terminology

node capacity - the instance type defined capacity of CPU and Memory, as reported as capacity when using kubectl describe node .... Note that r5.xlarge and n2-highmem-4 that both are listed as 4CPU 32GB, they have different capacity in memory.

# reported by kubectl describe node for a n2-highmem-4 node in GKE
Capacity:
cpu:                4
memory:             32880844Ki
Allocatable:
cpu:                3920m
memory:             29085900Ki

# reported by kubectl describe node for a r5.xlarge node in EKS
Capacity:
cpu:                         4
memory:                      32399860Ki
Allocatable:
cpu:                         3920m
memory:                      31383028Ki

node allocatable - what remains of the capacity after k8s has reserved some capacity for internals such as kubelet etc running on the node
non-user pods' requests per node - the requests from non-user pods on a user-dedicated node

Assumptions

I assume node capacity won't change over time, and could probably be looked up outside k8s in cloud provider docs as well.
I assume node allocatable could change over time based on k8s versions for example
I assume non-user pods' requests per node will change over time quite significantly, from for example:
- if its a GPU user-node or not with another daemonset for GPU drivers etc
- vertical autoscaling of existing pods
- cluster configuration can change amount of pods on each node (calico daemonset pods? kube-proxy? node-exporter, fluentd logging stuff, etc)
I assume non-user pods' requests per node will not change based on instance type (EDIT: i meant type or size) (unless its an instance with a GPU driver installer daemonset)

Strategy idea

Collect data about node capacity and node allocatable from each cluster and verify that capacity isn't changing, and learn if allocatable changes notably between clusters of different k8s versions etc
Collect data about non-user pods' requests per node from each cluster in a table, then summarize the max value found among EKS, GKE, and AKS clusters, and then also the max value for all of them.
Based on variation in the observed non-user pods' requests per node, come up with a non-user pods' request per-node safety margin
Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:
- a) node allocatable
- b) global or cloud provider specific maximum of the non-user pods' requests per node
- c) global or cloud provider specific non-user pods' requests per node safety margin

Motivation

We can avoid a lot of complexity with this simple formula, and avoiding the complexity is considered more important than the gains from optimizing beyond this simple strategy.

With reduced compelxity:

its easier to understand and document
its very low maintenance
its robust (assuming we have a node safety margin)

Drawbacks of the strategy seems limited to:

A few hundreds of MiB of non-requested memory per node, where our nodes have ~30GiB of memory.
A few hundreds of mCPU non-requested CPU per node, that still is expected to be used reliably by not imposing requests matching limits.

yuvipanda commented 1 year ago

Thanks for writing this, @consideRatio. This pretty much matches my intention, with one addition:

Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:

a) node allocatable b) global or cloud provider specific maximum of the non-user pods' requests per node c) global or cloud provider specific non-user pods' requests per node safety margin

That b, c are also instance size specific, in addition to being cloud provider specific.

consideRatio commented 1 year ago

I think that we should adjust what we consider available for user pods based on instance type and size, where a captures that variation in the formula a - b - c. But I don't see us needing for b and c to be functions of instance type/size. Do you?

Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:

If you do, does that mean that you think this assumption doesn't hold? In what situation wouldn't it hold?

I assume non-user pods' requests per node will not change based on instance type (EDIT: I meant type or size here) (unless its an instance with a GPU driver installer daemonset)

yuvipanda commented 1 year ago

But I don't see us needing for b and c to be functions of instance type/size. Do you?

Well, I just quickly measured this for r5.xlarge and r5.4xlarge and explicitly put that out in https://github.com/2i2c-org/infrastructure/pull/3030/files#diff-d9e835d51f5dadaa1c3c81a686a30c3472fed0695b1449414f4cd148bf8b1d0a. You'll notice that the measured_overhead is in fact different for these two instances. So yes, they do need to vary per instance.

consideRatio commented 1 year ago

But your script said "# just pick a node", not a dedicated user node that dont have misc stuff core nodes will have - right? I forgot to mention i only consider the user nodes where users can run as nodes of relevance to think about here.

If you have an example of non-user pods on a user node eating up more cpu/memory because they scheduled on a r5.xlarge than r5.4xlarge, that is would break my assumption.

yuvipanda commented 1 year ago

Alright, let me try validate that the script did indeed pick a user node and not a core node.

yuvipanda commented 1 year ago

Re-running the test, I actually caught a different problem that invalidated the last test - I hadn't waited for the node to become ready! Re-running it again.

yuvipanda commented 1 year ago

@consideRatio you were right that the r5.xlarge and r5.4xlarge had same overhead! However this wasn't because one was a core node vs not - I was just too eager running the test, so didn't actually wait for all the daemonsets to land there. I've now updated the script to both make sure that we are only looking at user nodes (which it was already doing, but accidentally!) and wait a reasonable amount from node being created to make sure pods get a chance to schedule.

I wanted to try to see if this is the same across different classes of nodes, so I tried it between an n1-highmem-4 and n2-highmem-32, and found those are different. A really quick investigation however shows that this is not due to the different families, but because cryptnono isn't actually running in our dedicated node pools (probably from lack of tolerations)! Should fix that.

So I'll try to validate that the overhead is in fact the same across instance types a little more thoroughly, fixing up the discrepancies I've found. And once I confirm that these numbers are the same across instance types on cloud providers, I'll simplify the generator part of the code to not have this depend on instance types. I specifically want to see what GPU nodes are doing.

But overall, unless things drastically change, I think you're right and I'll take instance type out of the equation!

Thank you for digging into this, @consideRatio!

yuvipanda commented 11 months ago

@consideRatio do you consider this resolved now? Can we close this issue?

2i2c-org / infrastructure