Closed consideRatio closed 9 months ago
Could we discuss this at our next prod/eng meeting? Your experience tells me this will be a good default for many cases but I am not completely clear on who is making the choice of machine type.
For some hubs, giving this discretion to the users makes sense but for others (such as educational settings or events) i am not as clear.
Is it possible for hub-admin to have any additional control over machine types that users wouldn't be able to see?
Regarding the cost/value, I think there is lots of potential work to do in this area to translate from what the cloud providers offer to what our end-users see. That's a different issue but one I think we can work in on behalf of our community partners. With so many different machine types, I think we need to be able to provide advice and recommendations.
Thank you @jmunroe for reading and considering this. Damián has scheduled this for discussion on the next prod/eng meet!
I am not completely clear on who is making the choice of machine type.
Currently, we are making the choice of what machine types the cloud should be able to start on demand, and what server options end users are presented with - often mapped to separate machine types capacity in CPU/RAM. Currently we could get a request for a specific machine type, but we are not actively asking for input about machine types.
I'd like our default provided machine types and the options to start servers against them to be so good that users don't find themselves constrained and need to request something extra. At least if it is only a matter of CPU and Memory as compared to a need for attached GPUs.
Is it possible for hub-admin to have any additional control over machine types that users wouldn't be able to see?
I'll provide a short response, but I want us to not linger on this topic as I think its out of scope for this issue.
Yes - it is possible to provide all kinds of logic to make certain users see certain options, as long as we can access state about this. The key issue is that it is tricky to provide this configuration in a standardized way across hubs using different authenticators etc. I've implemented this elsewhere in, making the choice of GPU servers show up only to a few people for example. How I did it is documented in this forum post.
With so many different machine types, I think we need to be able to provide advice and recommendations.
I agree we should write about this to some degree. I think we should settle for clarifying why we made our choice. After that, if its seen as pressing, we could also provide guidance for users to make their own choices, but I don't think we should start out with that. They are better of not needing to learn about this. If they already have learned and have opinions we can adjust to them still though!
I'm not technically knowledgeable enough to assess your technical suggestions, but they seem reasonable to me and I'll defer to the judgment of the @2i2c-org/engineering team 👍
For the decision in general, it sounds good to me as long as we can assume a baseline level of technical competence from users. This feels like a good feature for research-focused hubs, but not for educational hubs where users have no concept of a "node", "how much CPU they need", "how much RAM they need", etc. For those users, I think they just need to click "launch" and they get a user session without any choices at all. Is it possible to support both this more complex workflow @consideRatio and a simplified one for communities that have less-technical users?
Thank you so much for putting so much time and effort into this @consideRatio ❤️
Hope you don't mind that I will ask some questions to make sure I fully understand it
a KubeSpawner profile_list to start pods on each type where there is a 1:1 relation between nodes:pods. For each node, there is only one user pod.
@consideRatio this 1:1 mapping is actually enforced by setting guarantees (specifically mem_guarantee
, not sure if there's a cpu guarantee also?) very close the higher bound of the node's capacity, inkubespawner_override
, right?
If the above is true, then technically, to achieve a 1:n relationship, we could relax the guarantees to fit more pods, right? But this wouldn't necesarly be useful in practice, because the current node sizes are small, or rather normal sized?
And this is the motivation behind changing the machine types available to some that are bigger and better fit to be shared in a high memory usage scenario.
New default machine types and profile list options - sharing nodes is great!
This proposal is mainly for allowing/leveraging sharing nodes when using profile lists, right? When there's no options for the users to choose from, pods share do share nodes, right?
We default to provide a profile list that include three choices representing the three machine types, but for each, we allow the user to specify the share they want. Do they want a dedicated server, a 4th of a server, or a 16th of a server?
I fear this might be confusing for some? and maybe it would be better to replace it with a combination:
Also, because you can get 4GB by requesting 4th of a 16GB server, but also by requesting half of an 8GB server, then maybe by not strictly linking a pod to a specific machine type, we can achieve better packing of pods on nodes.
Thanks @GeorgianaElena for thinking about this!!
You got it! We would enforce how many pods should be put on the same node by CPU/Memory requests. I tried to clarify what the requests/limits I propose to accomplish this in point 5.
Since memory is harder to share than CPU, I also proposed machine types with high memory in relation to the CPU.
New default machine types and profile list options - sharing nodes is great!
This proposal is mainly for allowing/leveraging sharing nodes when using profile lists, right? When there's no options for the users to choose from, pods share do share nodes, right?
Technically, pods are always scheduled to share nodes if possible, but they end up being isolated in dedicated nodes if they request so much CPU or memory that they won't fit next to each other. So, no matter if a profile_list is used and only one machine type is assumed - node sharing will be dictated by if one or more pods can fit on the same node given the node capacity and the pods requested CPU/Memory.
Practically I'd argue that hubs only presented with one machine option should also default to a degree of node sharing on a highmem node as well. If only one option is provided, it should be tailored to the amount of users etc. The more users, the larger nodes and the larger degree of sharing I think.
I fear this might be confusing for some? and maybe it would be better to replace it with a combination:
- the actual resources they would get (i.e. what resources a 4th of server means)?
If I understand you correctly, I agree! We should avoid presenting options to users as "one forth" or "25%", but instead present what they at least get in CPU and Memory.
We should still make sure they also know what machine type is used, because at least for 4 out of 4 available CPU is different from at least 4 out of 16 available CPU.
BIG +1 here, for everything that has a profileList. I think educational hubs should not have profileLists by default, as the users don't know what these are. But everything that does have one - plus one on implementing this.
For history, we stole the current default from https://github.com/pangeo-data/pangeo-cloud-federation, which was mostly just 'there'. I think the proposed setup here is a clear unconditional positive, and we should accompany it with documentation as well so users can understand why these are the way they are too.
I want to pick out one as a specific item here that would be uncontroversial - I think that's switching to AMD nodes by default, especially on AWS.
Another would be to offer a small 'shared' option as default for all our research hubs, with a small pod that lands on a smallish node
For reference, I've written about this topic in our product and engineering meeting notes for the meeting 2023-02-14.
I wanted to get started writing down some preliminary notes on cpu/memory requests/limits for user pods that could make sense when using node sharing. I think its a very complicated topic where we want to find a balance between many things to optimize for. The text below is a brief start on writing down ideas, but I don't want to prioritize writing more about this atm since there is many topics I'd like to work on.
I suggest we provide a profile list with options to choose from pre-defined cpu and memory requests/limits for efficient node sharing, like piloted in openscapes and linked-earth currently, but what are good cpu/memory requests/limits to set?
When a pod is scheduled requests are very important, but limits isn't. If the combined requests from pods fit on a node, it can be scheduled there. We have our user pods schedule on user dedicated nodes, that though also needs to run some additional pods per node such as kube-proxy
.
Pods like kube-proxy
adds an overhead of cpu and memory requests for each node, indepented of the node's capacity. So if we use for a 2 core node and a 16 core node, we may get 1.8 and 15.8 CPU available to request. The smaller node received a smaller capacity of available cpu to request than the larger node per cpu on the node.
distributed
doesn't choose to make use of less than whats available just because we provide a low CPU request.I've investigated GCP n2-highmem-[4|16|64]
inside GKE and what remains allocatable, as well as AWS r5.[|4|16]xlarge
inside EKS and what remains allocatable.
My conclusion is that we must provide memory request based on each individual machine type because there seem to be no trustworthy formula to calculate it.
gcp: # as explored on GKE 1.25
# The e2-highmem options are 99.99% like the n2-highmem options
n2-highmem-4: # 29085928Ki, 27.738Gi, 29.783G
allocatable:
cpu: 3920m
ephemeral-storage: "47060071478"
memory: 29085928Ki
capacity:
cpu: "4"
ephemeral-storage: 98831908Ki
memory: 32880872Ki
n2-highmem-16: # 122210676Ki, 116.549Gi, 125.143G
allocatable:
cpu: 15890m
ephemeral-storage: "47060071478"
memory: 122210676Ki
capacity:
cpu: "16"
ephemeral-storage: 98831908Ki
memory: 131919220Ki
n2-highmem-64: # 510603916Ki, 486.949Gi, 522.858G
allocatable:
cpu: 63770m
ephemeral-storage: "47060071478"
memory: 510603916Ki
capacity:
cpu: "64"
ephemeral-storage: 98831908Ki
memory: 528365196Ki
pods_overhead: |
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-vbvg8 100m (0%) 0 (0%) 0 (0%) 0 (0%) 2m34s
kube-system fluentbit-gke-rb4n8 100m (0%) 0 (0%) 200Mi (0%) 500Mi (0%) 2m34s
kube-system gke-metadata-server-d8nnp 100m (0%) 100m (0%) 100Mi (0%) 100Mi (0%) 2m34s
kube-system gke-metrics-agent-7lnb2 6m (0%) 0 (0%) 100Mi (0%) 100Mi (0%) 2m34s
kube-system ip-masq-agent-6sk7j 10m (0%) 0 (0%) 16Mi (0%) 0 (0%) 2m34s
kube-system kube-proxy-gke-leap-cluster-nb-large-d1625cd8-0gc5 100m (0%) 0 (0%) 0 (0%) 0 (0%) 2m33s
kube-system netd-ntd7r 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m34s
kube-system pdcsi-node-x6lvw 10m (0%) 0 (0%) 20Mi (0%) 100Mi (0%) 2m34s
support support-cryptnono-sz7tp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m14s
support support-prometheus-node-exporter-8t74g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m14s
Resource Requests Limits
-------- -------- ------
cpu 426m (0%) 100m (0%)
memory 436Mi (0%) 800Mi (0%)
aws: # as explored on EKS 1.24
r5.xlarge: # 31391968Ki, 29.937Gi, 32.145G
allocatable:
cpu: 3920m
ephemeral-storage: "76224326324"
memory: 31391968Ki
capacity:
cpu: "4"
ephemeral-storage: 83873772Ki
memory: 32408800Ki
r5.4xlarge: # 127415760Ki, 121.513Gi, 130.473G
allocatable:
cpu: 15890m
ephemeral-storage: "76224326324"
memory: 127415760Ki
capacity:
cpu: "16"
ephemeral-storage: 83873772Ki
memory: 130415056Ki
r5.16xlarge: # 513938668Ki, 490.130Gi, 526.273G
allocatable:
cpu: 63770m
ephemeral-storage: "76224326324"
memory: 513938668Ki
capacity:
cpu: "64"
ephemeral-storage: 83873772Ki
memory: 522603756Ki
pods_overhead: |
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
amazon-cloudwatch fluent-bit-w464b 500m (3%) 0 (0%) 100Mi (0%) 200Mi (0%) 7m29s
kube-system aws-node-cf989 25m (0%) 0 (0%) 0 (0%) 0 (0%) 7m29s
kube-system ebs-csi-node-csht5 30m (0%) 300m (1%) 120Mi (0%) 768Mi (0%) 7m28s
kube-system kube-proxy-2wlq9 100m (0%) 0 (0%) 0 (0%) 0 (0%) 7m28s
support support-cryptnono-crqrq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m8s
support support-prometheus-node-exporter-6dlxw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m8s
Resource Requests Limits
-------- -------- ------
cpu 655m (4%) 300m (1%)
memory 220Mi (0%) 968Mi (0%)
In the product and engineering meeting April 4 2023 we agreed that I will try to document these ideas for other other 2i2c engineers during Q2.
With the move to node sharing, we probably want to update the first line of this documentation page as I think it is no longer accurate: https://docs.2i2c.org/user/topics/data/filesystem/
EDIT: Erik added note about this to #2041
While this issue had a purpose to try to capture the kind of changes I think made sense, it was also a large issue with many things that is now hard to track. I've opened a lot of other issues that tracks parts of this instead, so I'm now closing this to help us focus on smaller pieces.
Our clusters ships with a default set of machine types, and we provide a KubeSpawner profile_list to start pods on each type where there is a 1:1 relation between nodes:pods. For each node, there is only one user pod.
I thinnk a
1:n
nodes:pods relationship should be the default, not1:1
! With1:n
, I think we will provide a far better experience for users and for ourselves since it will reduce the amount of support tickets we will get I think. I think I can motivate this thoroughly, but to not write a long post, let me propose new defaults instead.Proposed new defaults
n2-highmem-4
/e2-highmem-4
on GCP andr5.xlarge
/r5a.xlarge
as the smallest node types for GCP / AWS.n1-highmem
machines doesn't have 1:8 ratio between CPU:Memory and shouldn't be considered, then2
machines are also having more performant CPUs. Only when GPU's are needed, we must usen1
as compared ton2
.n
is 1 for dedicated, 4 for a 4th of a server etc.Please review with :+1: or comment
I'd love to be actionable about this as I think its a super important change, but for me to feel actionable I'd like to see agreement on the direction. @2i2c-org/engineering could you give a :+1: to this post if you are positive to the changes suggested or leave a comment describing what you think?
If possible, please also opine if we should make use of AMD servers by default on GCP and AWS respectively. I'm not sure at all, intel is more tested, but 30% savings on AWS is a big deal.
References
r5a.xlarge
/r5a.4xlarge
/r5a.16xlarge
e2-highmem-4
/e2-highmem-16
/n2-highmem-64
Standard_E4a_v4
/Standard_E16a_v4
/Standard_E64a_v4
Motivation
:+1: in this ticket a user ended up wanting to try using a 64 CPU node, but we only had 16 CPU nodes setup by default. If we had 4 / 16 / 64 CPU nodes by default instead of 2 / 4 / 8 / 16 the hub users would have a bit more flexibility.