New default machine types and profile list options - sharing nodes is great!

consideRatio commented 1 year ago

Our clusters ships with a default set of machine types, and we provide a KubeSpawner profile_list to start pods on each type where there is a 1:1 relation between nodes:pods. For each node, there is only one user pod.

I thinnk a 1:n nodes:pods relationship should be the default, not 1:1! With 1:n, I think we will provide a far better experience for users and for ourselves since it will reduce the amount of support tickets we will get I think. I think I can motivate this thoroughly, but to not write a long post, let me propose new defaults instead.

Proposed new defaults

High memory machine types - Intel or AMD UPDATE: Now tracked via #2511, #3210 Let's provide high memory machine types instead of normal machine types as memory is the limitation when sharing nodes.
- I suggest the either the n2-highmem-4 / e2-highmem-4 on GCP and r5.xlarge / r5a.xlarge as the smallest node types for GCP / AWS.
- Note that n1-highmem machines doesn't have 1:8 ratio between CPU:Memory and shouldn't be considered, the n2 machines are also having more performant CPUs. Only when GPU's are needed, we must use n1 as compared to n2.
Intel or AMD UPDATE: we stick with intel The difference between n2/e2 and r5/r5a on GCP and AWS is between Intel and AMD processors, where the AMD processors are ~10% and ~30% cheaper on GCP and AWS respectively. I suggest we default to AMD unless we foresee common issues. EDIT: It was GCP that was 30% cheaper, and AWS 10% cheaper.
4 / 16 / 64 CPU machines - no in-between choices UPDATE: Tracked via #3256 Provide machine types that increment 4x in size to simplify a lot, where we start at 4 CPU, then 16 CPU, then 64 CPU.
1:n request of a machine, where n is 1 / 4 / 16 UPDATE: tracked via #3030 We default to provide a profile list that include three choices representing the three machine types, but for each, we allow the user to specify the share they want. Do they want a dedicated server, a 4th of a server, or a 16th of a server?
About how requests/limits are set UPDATE: tracked via #3030 The CPU requests/limits and Memory requests/limits should become as below, where n is 1 for dedicated, 4 for a 4th of a server etc.
- CPU requests 1/n, CPU limits 1
- Memory requests 1/n, Memory limits 1/n When calculating this, we should account for some important pods need to run on each node, and save small space for them.

Please review with :+1: or comment

I'd love to be actionable about this as I think its a super important change, but for me to feel actionable I'd like to see agreement on the direction. @2i2c-org/engineering could you give a :+1: to this post if you are positive to the changes suggested or leave a comment describing what you think?

If possible, please also opine if we should make use of AMD servers by default on GCP and AWS respectively. I'm not sure at all, intel is more tested, but 30% savings on AWS is a big deal.

References

AWS instance types and pricing, ~10% lower with AMD, r5a.xlarge / r5a.4xlarge / r5a.16xlarge
GCP machine families and pricing, ~30% lower with AMD, e2-highmem-4 / e2-highmem-16 / n2-highmem-64
- Note that GCP doesn't have e2-highmem-64 as an option.
Azure instance types and pricing, 10% lower with AMD, Standard_E4a_v4 / Standard_E16a_v4 / Standard_E64a_v4
I recall mybinder.org-deploy has provided ~100 users per n1-highmem-16 machine or similar in the past, and such sharing has been absolutely essential for a good cost optimization.

Motivation

:arrow_down: cloud cost, average time waited for a node to startup and image to be pulled, cloud quota increase requests, overall user experience, fewer node pools and maintenance time when upgrading k8s etc
:+1: when a user is asking if there is a way for the user to pre-warm nodes, I'd like to answer yes - you can ask users to start 1/16 of a 16 CPU server and start one like that yourself. But, if node sharing isn't an option for this hub yet, then the answer is no.
:+1: we can push the boundaries of participants if users are used to not overprovisioning capacity. In this request we have 17 GB nodes for 150-300 users. Maybe if we provided shared options by default, this request could have been different and easier to handle within cloud quotas. I'm proposed to use a 64 CPU / 512 GB memory machine, shared by 32 users, each granted 2 CPU and 16 GB of memory each at least to be safe, but it would be reasonable to have 100 users on a machine like this for most events.
:+1: providing a user placeholder pod sized like ~1/4 of a small node could be a great user experience improvement for hubs if users appreciate sharing the small node as well, which I think many will be - 4 CPU and 16 GB memory is more than you need almost always. If only dedicated nodes are used, the benefit of user placeholder pods go down significantly.
:+1: I think this could become a good blog post after its implemented and feedback has arrived. Note that this is made possible by our ability to invest time to deliberate about machine type choices and Yuvi's contribution to KubeSpawner allowing for the ability to choose among options.
:+1: I've gained experience of this for the JMTE hub, and providing 4/16/64 and sharing getting dedicated/4th/16th of a server has has worked out great overall
:+1: When using node sharing we get fewer nodes, and with fewer nodes we reduce issues like public ip quota limitations and prometheus-server overloaded with data from nodes CPU/Memory use

:+1: in this ticket a user ended up wanting to try using a 64 CPU node, but we only had 16 CPU nodes setup by default. If we had 4 / 16 / 64 CPU nodes by default instead of 2 / 4 / 8 / 16 the hub users would have a bit more flexibility.

jmunroe commented 1 year ago

Could we discuss this at our next prod/eng meeting? Your experience tells me this will be a good default for many cases but I am not completely clear on who is making the choice of machine type.

For some hubs, giving this discretion to the users makes sense but for others (such as educational settings or events) i am not as clear.

Is it possible for hub-admin to have any additional control over machine types that users wouldn't be able to see?

Regarding the cost/value, I think there is lots of potential work to do in this area to translate from what the cloud providers offer to what our end-users see. That's a different issue but one I think we can work in on behalf of our community partners. With so many different machine types, I think we need to be able to provide advice and recommendations.

consideRatio commented 1 year ago

Thank you @jmunroe for reading and considering this. Damián has scheduled this for discussion on the next prod/eng meet!

I am not completely clear on who is making the choice of machine type.

Currently, we are making the choice of what machine types the cloud should be able to start on demand, and what server options end users are presented with - often mapped to separate machine types capacity in CPU/RAM. Currently we could get a request for a specific machine type, but we are not actively asking for input about machine types.

I'd like our default provided machine types and the options to start servers against them to be so good that users don't find themselves constrained and need to request something extra. At least if it is only a matter of CPU and Memory as compared to a need for attached GPUs.

Is it possible for hub-admin to have any additional control over machine types that users wouldn't be able to see?

I'll provide a short response, but I want us to not linger on this topic as I think its out of scope for this issue.

Yes - it is possible to provide all kinds of logic to make certain users see certain options, as long as we can access state about this. The key issue is that it is tricky to provide this configuration in a standardized way across hubs using different authenticators etc. I've implemented this elsewhere in, making the choice of GPU servers show up only to a few people for example. How I did it is documented in this forum post.

With so many different machine types, I think we need to be able to provide advice and recommendations.

I agree we should write about this to some degree. I think we should settle for clarifying why we made our choice. After that, if its seen as pressing, we could also provide guidance for users to make their own choices, but I don't think we should start out with that. They are better of not needing to learn about this. If they already have learned and have opinions we can adjust to them still though!

choldgraf commented 1 year ago

I'm not technically knowledgeable enough to assess your technical suggestions, but they seem reasonable to me and I'll defer to the judgment of the @2i2c-org/engineering team 👍

For the decision in general, it sounds good to me as long as we can assume a baseline level of technical competence from users. This feels like a good feature for research-focused hubs, but not for educational hubs where users have no concept of a "node", "how much CPU they need", "how much RAM they need", etc. For those users, I think they just need to click "launch" and they get a user session without any choices at all. Is it possible to support both this more complex workflow @consideRatio and a simplified one for communities that have less-technical users?

GeorgianaElena commented 1 year ago

Thank you so much for putting so much time and effort into this @consideRatio ❤️

Hope you don't mind that I will ask some questions to make sure I fully understand it

Questions

1.

a KubeSpawner profile_list to start pods on each type where there is a 1:1 relation between nodes:pods. For each node, there is only one user pod.

@consideRatio this 1:1 mapping is actually enforced by setting guarantees (specifically mem_guarantee, not sure if there's a cpu guarantee also?) very close the higher bound of the node's capacity, inkubespawner_override, right?

If the above is true, then technically, to achieve a 1:n relationship, we could relax the guarantees to fit more pods, right? But this wouldn't necesarly be useful in practice, because the current node sizes are small, or rather normal sized?

And this is the motivation behind changing the machine types available to some that are bigger and better fit to be shared in a high memory usage scenario.

2.

New default machine types and profile list options - sharing nodes is great!

This proposal is mainly for allowing/leveraging sharing nodes when using profile lists, right? When there's no options for the users to choose from, pods share do share nodes, right?

2.

We default to provide a profile list that include three choices representing the three machine types, but for each, we allow the user to specify the share they want. Do they want a dedicated server, a 4th of a server, or a 16th of a server?

I fear this might be confusing for some? and maybe it would be better to replace it with a combination:

the actual resources they would get (i.e. what resources a 4th of server means)?
ok or not with sharing a server?

Also, because you can get 4GB by requesting 4th of a 16GB server, but also by requesting half of an 8GB server, then maybe by not strictly linking a pod to a specific machine type, we can achieve better packing of pods on nodes.

consideRatio commented 1 year ago

Thanks @GeorgianaElena for thinking about this!!

1.

You got it! We would enforce how many pods should be put on the same node by CPU/Memory requests. I tried to clarify what the requests/limits I propose to accomplish this in point 5.

Since memory is harder to share than CPU, I also proposed machine types with high memory in relation to the CPU.

2.

New default machine types and profile list options - sharing nodes is great!

This proposal is mainly for allowing/leveraging sharing nodes when using profile lists, right? When there's no options for the users to choose from, pods share do share nodes, right?

Technically, pods are always scheduled to share nodes if possible, but they end up being isolated in dedicated nodes if they request so much CPU or memory that they won't fit next to each other. So, no matter if a profile_list is used and only one machine type is assumed - node sharing will be dictated by if one or more pods can fit on the same node given the node capacity and the pods requested CPU/Memory.

Practically I'd argue that hubs only presented with one machine option should also default to a degree of node sharing on a highmem node as well. If only one option is provided, it should be tailored to the amount of users etc. The more users, the larger nodes and the larger degree of sharing I think.

3.

I fear this might be confusing for some? and maybe it would be better to replace it with a combination:

the actual resources they would get (i.e. what resources a 4th of server means)?

If I understand you correctly, I agree! We should avoid presenting options to users as "one forth" or "25%", but instead present what they at least get in CPU and Memory.

We should still make sure they also know what machine type is used, because at least for 4 out of 4 available CPU is different from at least 4 out of 16 available CPU.

yuvipanda commented 1 year ago

BIG +1 here, for everything that has a profileList. I think educational hubs should not have profileLists by default, as the users don't know what these are. But everything that does have one - plus one on implementing this.

For history, we stole the current default from https://github.com/pangeo-data/pangeo-cloud-federation, which was mostly just 'there'. I think the proposed setup here is a clear unconditional positive, and we should accompany it with documentation as well so users can understand why these are the way they are too.

yuvipanda commented 1 year ago

I want to pick out one as a specific item here that would be uncontroversial - I think that's switching to AMD nodes by default, especially on AWS.

yuvipanda commented 1 year ago

Another would be to offer a small 'shared' option as default for all our research hubs, with a small pod that lands on a smallish node

consideRatio commented 1 year ago

For reference, I've written about this topic in our product and engineering meeting notes for the meeting 2023-02-14.

consideRatio commented 1 year ago

Work in progress notes!

I wanted to get started writing down some preliminary notes on cpu/memory requests/limits for user pods that could make sense when using node sharing. I think its a very complicated topic where we want to find a balance between many things to optimize for. The text below is a brief start on writing down ideas, but I don't want to prioritize writing more about this atm since there is many topics I'd like to work on.

Planning user pods' requests / limits

I suggest we provide a profile list with options to choose from pre-defined cpu and memory requests/limits for efficient node sharing, like piloted in openscapes and linked-earth currently, but what are good cpu/memory requests/limits to set?

Scheduling

When a pod is scheduled requests are very important, but limits isn't. If the combined requests from pods fit on a node, it can be scheduled there. We have our user pods schedule on user dedicated nodes, that though also needs to run some additional pods per node such as kube-proxy.

Pods like kube-proxy adds an overhead of cpu and memory requests for each node, indepented of the node's capacity. So if we use for a 2 core node and a 16 core node, we may get 1.8 and 15.8 CPU available to request. The smaller node received a smaller capacity of available cpu to request than the larger node per cpu on the node.

Parts of available capacity

Ideas on a fractional cpu request

Fail to schedule by RAM requests If we want to fit 10 users on a 10 GB node, the idea is to let all users requests 10 GB minus the overhead and some margin divided by 10.
Never fail to schedule based on CPU request - use a fixed fraction If we want to fit 10 users on a 10 CPU node, the idea is to not let all users requests 1 CPU each, but instead 0.1 CPU each. Like this:
- let system pods CPU needs get prioritized if there is a CPU shortage, as their requests are not fractionalized
- we can disregard the system pods' request overhead and not risk failing to schedule one user per node's CPU, which reduces complexity
- we aren't known to run into trouble from software self-imposed limitations, see #2228
Keep an eye out for software imposed throttling While a pod's container can make use of as much CPU as the node has or the container is limited to, its possible for it to choose to make use of less. This was explored in #2228, where I conclude that distributed doesn't choose to make use of less than whats available just because we provide a low CPU request.

gcp: # as explored on GKE 1.25
  # The e2-highmem options are 99.99% like the n2-highmem options
  n2-highmem-4: # 29085928Ki, 27.738Gi, 29.783G
    allocatable:
      cpu: 3920m
      ephemeral-storage: "47060071478"
      memory: 29085928Ki
    capacity:
      cpu: "4"
      ephemeral-storage: 98831908Ki
      memory: 32880872Ki
  n2-highmem-16: # 122210676Ki, 116.549Gi, 125.143G
    allocatable:
      cpu: 15890m         
      ephemeral-storage: "47060071478"
      memory: 122210676Ki
    capacity:
      cpu: "16"
      ephemeral-storage: 98831908Ki
      memory: 131919220Ki
  n2-highmem-64: # 510603916Ki, 486.949Gi, 522.858G
    allocatable:
      cpu: 63770m
      ephemeral-storage: "47060071478"
      memory: 510603916Ki
    capacity:
      cpu: "64"
      ephemeral-storage: 98831908Ki
      memory: 528365196Ki
  pods_overhead: |
    Namespace                   Name                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
    ---------                   ----                                                  ------------  ----------  ---------------  -------------  ---
    kube-system                 calico-node-vbvg8                                     100m (0%)     0 (0%)      0 (0%)           0 (0%)         2m34s
    kube-system                 fluentbit-gke-rb4n8                                   100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     2m34s
    kube-system                 gke-metadata-server-d8nnp                             100m (0%)     100m (0%)   100Mi (0%)       100Mi (0%)     2m34s
    kube-system                 gke-metrics-agent-7lnb2                               6m (0%)       0 (0%)      100Mi (0%)       100Mi (0%)     2m34s
    kube-system                 ip-masq-agent-6sk7j                                   10m (0%)      0 (0%)      16Mi (0%)        0 (0%)         2m34s
    kube-system                 kube-proxy-gke-leap-cluster-nb-large-d1625cd8-0gc5    100m (0%)     0 (0%)      0 (0%)           0 (0%)         2m33s
    kube-system                 netd-ntd7r                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m34s
    kube-system                 pdcsi-node-x6lvw                                      10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     2m34s
    support                     support-cryptnono-sz7tp                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m14s
    support                     support-prometheus-node-exporter-8t74g                0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m14s

    Resource                   Requests    Limits
    --------                   --------    ------
    cpu                        426m (0%)   100m (0%)
    memory                     436Mi (0%)  800Mi (0%)
aws: # as explored on EKS 1.24
  r5.xlarge: # 31391968Ki, 29.937Gi, 32.145G
    allocatable:
      cpu: 3920m
      ephemeral-storage: "76224326324"
      memory: 31391968Ki
    capacity:
      cpu: "4"
      ephemeral-storage: 83873772Ki
      memory: 32408800Ki
  r5.4xlarge: # 127415760Ki, 121.513Gi, 130.473G
    allocatable:
      cpu: 15890m
      ephemeral-storage: "76224326324"
      memory: 127415760Ki
    capacity:
      cpu: "16"
      ephemeral-storage: 83873772Ki
      memory: 130415056Ki
  r5.16xlarge: # 513938668Ki, 490.130Gi, 526.273G
    allocatable:
      cpu: 63770m
      ephemeral-storage: "76224326324"
      memory: 513938668Ki
    capacity:
      cpu: "64"
      ephemeral-storage: 83873772Ki
      memory: 522603756Ki
  pods_overhead: |
    Namespace                   Name                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
    ---------                   ----                                      ------------  ----------  ---------------  -------------  ---
    amazon-cloudwatch           fluent-bit-w464b                          500m (3%)     0 (0%)      100Mi (0%)       200Mi (0%)     7m29s
    kube-system                 aws-node-cf989                            25m (0%)      0 (0%)      0 (0%)           0 (0%)         7m29s
    kube-system                 ebs-csi-node-csht5                        30m (0%)      300m (1%)   120Mi (0%)       768Mi (0%)     7m28s
    kube-system                 kube-proxy-2wlq9                          100m (0%)     0 (0%)      0 (0%)           0 (0%)         7m28s
    support                     support-cryptnono-crqrq                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m8s
    support                     support-prometheus-node-exporter-6dlxw    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m8s

    Resource                    Requests    Limits
    --------                    --------    ------
    cpu                         655m (4%)   300m (1%)
    memory                      220Mi (0%)  968Mi (0%)

consideRatio commented 1 year ago

In the product and engineering meeting April 4 2023 we agreed that I will try to document these ideas for other other 2i2c engineers during Q2.

[ ] Distill insights and ideas etc about this into docs for 2i2c engineers

In https://github.com/2i2c-org/infrastructure/issues/2430#issuecomment-1497279417 I wrote a text of relevance with regards to memory requests and limits when sharing a node's capacity. Writing that up, I evolve my own opinion a bit thinking now its worth having a preliminary memory limit of 4x the requested memory for each node share choice to help prevent user server pod evictions.

sgibson91 commented 11 months ago

With the move to node sharing, we probably want to update the first line of this documentation page as I think it is no longer accurate: https://docs.2i2c.org/user/topics/data/filesystem/

EDIT: Erik added note about this to #2041

consideRatio commented 9 months ago

While this issue had a purpose to try to capture the kind of changes I think made sense, it was also a large issue with many things that is now hard to track. I've opened a lot of other issues that tracks parts of this instead, so I'm now closing this to help us focus on smaller pieces.

2i2c-org / infrastructure