2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
103 stars 62 forks source link

Discuss how to calculate overhead when setting resource allocation choices #3132

Open consideRatio opened 1 year ago

consideRatio commented 1 year ago

I chatted with Yuvi about node sharing setups, and I'd like to summarize what I think could be a good strategy to use in https://github.com/2i2c-org/infrastructure/pull/3030.

This issue is just a way to document my proposed strategy for #3030, and it doesn't have an action point to track by itself. Let's close it once this has been considered and 3030 has been merged.

Terminology

Assumptions

Strategy idea

  1. Collect data about node capacity and node allocatable from each cluster and verify that capacity isn't changing, and learn if allocatable changes notably between clusters of different k8s versions etc
  2. Collect data about non-user pods' requests per node from each cluster in a table, then summarize the max value found among EKS, GKE, and AKS clusters, and then also the max value for all of them.
  3. Based on variation in the observed non-user pods' requests per node, come up with a non-user pods' request per-node safety margin
  4. Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:
    • a) node allocatable
    • b) global or cloud provider specific maximum of the non-user pods' requests per node
    • c) global or cloud provider specific non-user pods' requests per node safety margin

Motivation

We can avoid a lot of complexity with this simple formula, and avoiding the complexity is considered more important than the gains from optimizing beyond this simple strategy.

With reduced compelxity:

Drawbacks of the strategy seems limited to:

yuvipanda commented 1 year ago

Thanks for writing this, @consideRatio. This pretty much matches my intention, with one addition:

Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:

a) node allocatable b) global or cloud provider specific maximum of the non-user pods' requests per node c) global or cloud provider specific non-user pods' requests per node safety margin

That b, c are also instance size specific, in addition to being cloud provider specific.

consideRatio commented 1 year ago

I think that we should adjust what we consider available for user pods based on instance type and size, where a captures that variation in the formula a - b - c. But I don't see us needing for b and c to be functions of instance type/size. Do you?

Determine what we consider available for user pods based on the formula a - b - c, where a, b, and c are:

If you do, does that mean that you think this assumption doesn't hold? In what situation wouldn't it hold?

  • I assume non-user pods' requests per node will not change based on instance type (EDIT: I meant type or size here) (unless its an instance with a GPU driver installer daemonset)
yuvipanda commented 1 year ago

But I don't see us needing for b and c to be functions of instance type/size. Do you?

Well, I just quickly measured this for r5.xlarge and r5.4xlarge and explicitly put that out in https://github.com/2i2c-org/infrastructure/pull/3030/files#diff-d9e835d51f5dadaa1c3c81a686a30c3472fed0695b1449414f4cd148bf8b1d0a. You'll notice that the measured_overhead is in fact different for these two instances. So yes, they do need to vary per instance.

consideRatio commented 1 year ago

But your script said "# just pick a node", not a dedicated user node that dont have misc stuff core nodes will have - right? I forgot to mention i only consider the user nodes where users can run as nodes of relevance to think about here.

If you have an example of non-user pods on a user node eating up more cpu/memory because they scheduled on a r5.xlarge than r5.4xlarge, that is would break my assumption.

yuvipanda commented 1 year ago

Alright, let me try validate that the script did indeed pick a user node and not a core node.

yuvipanda commented 1 year ago

Re-running the test, I actually caught a different problem that invalidated the last test - I hadn't waited for the node to become ready! Re-running it again.

yuvipanda commented 1 year ago

@consideRatio you were right that the r5.xlarge and r5.4xlarge had same overhead! However this wasn't because one was a core node vs not - I was just too eager running the test, so didn't actually wait for all the daemonsets to land there. I've now updated the script to both make sure that we are only looking at user nodes (which it was already doing, but accidentally!) and wait a reasonable amount from node being created to make sure pods get a chance to schedule.

I wanted to try to see if this is the same across different classes of nodes, so I tried it between an n1-highmem-4 and n2-highmem-32, and found those are different. A really quick investigation however shows that this is not due to the different families, but because cryptnono isn't actually running in our dedicated node pools (probably from lack of tolerations)! Should fix that.

So I'll try to validate that the overhead is in fact the same across instance types a little more thoroughly, fixing up the discrepancies I've found. And once I confirm that these numbers are the same across instance types on cloud providers, I'll simplify the generator part of the code to not have this depend on instance types. I specifically want to see what GPU nodes are doing.

But overall, unless things drastically change, I think you're right and I'll take instance type out of the equation!

Thank you for digging into this, @consideRatio!

yuvipanda commented 11 months ago

@consideRatio do you consider this resolved now? Can we close this issue?