Move each hub to its own nodegroup on the openscapes cluster

yuvipanda commented 4 months ago

After the outcome of the spike in https://github.com/2i2c-org/infrastructure/issues/4465, we are going to give each hub its own nodepool that is properly tagged to track cost on a per-hub basis.

[x] Create the same set of nodes for staging, prod and workshop
[x] Configure each of the hubs so their users only spawn on the nodepools designated for them
[x] Tag each of the nodepool with 2i2c:hub-name to match the name of the hub
[x] Tag each of the notebook nodepools with 2i2c:node-purpose set to user

Definition of done

[x] Users across these hubs will spawn on to their own nodepools, and not share them with users on other hubs.
[x] There is a tag 2i2c:hub-name and 2i2c:node-purpose on all the nodes spawned when users log on to the hub. You can verify this by looking at EC2 instances on the AWS console.

Trade-offs

Since our health check triggers a user spawn, this means that instead of spawning 1 node when we trigger deploys on all of the hubs, we will trigger 3 separate nodes. This is fine - the autoscaler reclaims them after ~10min, and even with the largest nodes that doesn't cost enough to be a problem.

Out of scope

dask-gateway is out of scope here, and handled by https://github.com/2i2c-org/infrastructure/issues/4485

sgibson91 commented 4 months ago

I think there's a language problem here (and in #4486) of tags vs. labels, both of which exist. As I understand it, tags operate at the cloud vendor level, but labels can be used as selectors at the kubernetes level. If we want pods to be spun up in specific node pools, we definitely want to be using labels. But I don't know if the cost-tracking system we are going to use will be looking at cloud tags or kubernetes labels.

In the end, neither of which are costly to apply so I will probably just do both.

sgibson91 commented 4 months ago

I'm attempting this, but I have no idea where to put the node_selector.2i2c/hub-name: <hub-name> value. I've have to copy the whole profile list of image options out of common values file because kubespawner_override.node_selector is a true override and doesn't merge with singleuser.nodeSelector. Also helm overwriting lists means I can't merge config that what either. #4499 represents what I've tried for staging but it doesn't work.

yuvipanda commented 4 months ago

kubespawner_override.node_selector is a true override and doesn't merge with singleuser.nodeSelector

If they are dictionaries (rather than lists), they should merge (since https://github.com/jupyterhub/kubespawner/pull/650). So your instinct to put it in singleuser.nodeSelector is correct. You can also try hub.config.KubeSpawner.node_selector, although it should be the same as singleuser.nodeselector.

it doesn't work.

Can you provide more detail?

sgibson91 commented 4 months ago

This is using my first instinct to add singleuser.nodeSelector. We're basically not triggering the new node pool(s) at all. Currently deployed config is in #4499

2i2c-org / infrastructure