Open consideRatio opened 1 year ago
Thanks for writing this, @consideRatio. This pretty much matches my intention, with one addition:
Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:
a) node allocatable b) global or cloud provider specific maximum of the non-user pods' requests per node c) global or cloud provider specific non-user pods' requests per node safety margin
That b, c are also instance size specific, in addition to being cloud provider specific.
I think that we should adjust what we consider available for user pods based on instance type and size, where a
captures that variation in the formula a - b - c
. But I don't see us needing for b
and c
to be functions of instance type/size. Do you?
Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:
If you do, does that mean that you think this assumption doesn't hold? In what situation wouldn't it hold?
- I assume non-user pods' requests per node will not change based on instance type (EDIT: I meant type or size here) (unless its an instance with a GPU driver installer daemonset)
But I don't see us needing for b and c to be functions of instance type/size. Do you?
Well, I just quickly measured this for r5.xlarge and r5.4xlarge and explicitly put that out in https://github.com/2i2c-org/infrastructure/pull/3030/files#diff-d9e835d51f5dadaa1c3c81a686a30c3472fed0695b1449414f4cd148bf8b1d0a. You'll notice that the measured_overhead
is in fact different for these two instances. So yes, they do need to vary per instance.
But your script said "# just pick a node", not a dedicated user node that dont have misc stuff core nodes will have - right? I forgot to mention i only consider the user nodes where users can run as nodes of relevance to think about here.
If you have an example of non-user pods on a user node eating up more cpu/memory because they scheduled on a r5.xlarge than r5.4xlarge, that is would break my assumption.
Alright, let me try validate that the script did indeed pick a user node and not a core node.
Re-running the test, I actually caught a different problem that invalidated the last test - I hadn't waited for the node to become ready! Re-running it again.
@consideRatio you were right that the r5.xlarge and r5.4xlarge had same overhead! However this wasn't because one was a core node vs not - I was just too eager running the test, so didn't actually wait for all the daemonsets to land there. I've now updated the script to both make sure that we are only looking at user nodes (which it was already doing, but accidentally!) and wait a reasonable amount from node being created to make sure pods get a chance to schedule.
I wanted to try to see if this is the same across different classes of nodes, so I tried it between an n1-highmem-4 and n2-highmem-32, and found those are different. A really quick investigation however shows that this is not due to the different families, but because cryptnono isn't actually running in our dedicated node pools (probably from lack of tolerations)! Should fix that.
So I'll try to validate that the overhead is in fact the same across instance types a little more thoroughly, fixing up the discrepancies I've found. And once I confirm that these numbers are the same across instance types on cloud providers, I'll simplify the generator part of the code to not have this depend on instance types. I specifically want to see what GPU nodes are doing.
But overall, unless things drastically change, I think you're right and I'll take instance type out of the equation!
Thank you for digging into this, @consideRatio!
@consideRatio do you consider this resolved now? Can we close this issue?
I chatted with Yuvi about node sharing setups, and I'd like to summarize what I think could be a good strategy to use in https://github.com/2i2c-org/infrastructure/pull/3030.
This issue is just a way to document my proposed strategy for #3030, and it doesn't have an action point to track by itself. Let's close it once this has been considered and 3030 has been merged.
Terminology
capacity
when usingkubectl describe node ...
. Note thatr5.xlarge
andn2-highmem-4
that both are listed as 4CPU 32GB, they have different capacity in memory.kubelet
etc running on the nodeAssumptions
Strategy idea
a - b - c
, where a, b, and c are:Motivation
We can avoid a lot of complexity with this simple formula, and avoiding the complexity is considered more important than the gains from optimizing beyond this simple strategy.
With reduced compelxity:
Drawbacks of the strategy seems limited to: