Using dask-gateway - Githubissues

bolliger32 commented 3 years ago

Update config to use the daskhub chart (the pangeo chart we have been using is an old version, and that entire chart is now deprecated in favor of daskhub.

I roughly tried to follow how the pangeo clusters are currently set up: https://github.com/pangeo-data/pangeo-cloud-federation

The main benefits of this switch are:

dask-gateway seems to be where things are moving. dask-kubernetes is not deprecated but it seems like most ppl are using gateway now.
We have been following the pangeo cloud model in general, but have been stuck at the last version of their helm chart that used dask-kubernetes. Updating allows us to stay more current.
dask-gateway allows for more robust remote schedulers. You can now do this in dask-kubernetes and we had experimented a bit, but all schedulers are their own pod in dask-gateway and it seems like overall a more straightforward option

The downside: it's a lot of moving pieces

This is currently deployed on adrastea. I have updated rhg_compute_tools to be able to handle dask-gateway (https://github.com/RhodiumGroup/rhg_compute_tools/pull/87) and have updated the coastal image to work with dask-gateway. Going to work on creating a branch from master that works on gateway too.

I did not make any changes to impactlab_config.yml (only jupyter_config.yml). I'm not totally sure how the impactlab config is used so I didn't want to touch that yet.

Here's a list of things that are being done in the pangeo repo that I did not implement but that we might want to think of at some point. Just jotting them down here as a flag:

Using hubploy. This way, we could wrap the deployment to staging and prod clusters into the CI. It would also make us include the api tokens in github secrets, which could be more robust than the dropbox paper doc for the long run. In general, I think it would just make the deployment steps easier.
Using prometheus - honestly I don't really know much about what this is...I think it's some sort of kubernetes monitoring client? Not sure if it's useful to us at all - just flagging that that's something they are now using.
Using NFS for user storage. This is something they started a while ago and might be worth looking into. I think it would allow a more adaptive user storage size so we aren't carving out a fixed amount of storage per user regardless of what they are using.

brews commented 3 years ago

Have you run anything dask-y through the gateway, yet? :-)

bolliger32 commented 3 years ago

So far I have run client.gather(client.submit(lambda x: x+1, [2,3])) :) gonna start testing a bit more now

delgadom commented 3 years ago

I don't understand why everyone hates on our ancestral-wisdom-based cluster upgrade system. it's super fun and only breaks like 70% of the time!

brews commented 3 years ago

A few more comments to @bolliger32 questions from this PR description. I agree that they're not things we should deal with in this PR but def worth bringing up (Thank you @bolliger32!).

Using hubploy: Ehhh. Personally I'm not crazy about a custom just app to manage CI/CD just for another custom app. A big part of the problem is that we're combining infra provisioning and deploying/configuring an app with a large number of complex components. So, it's not just a jupyterhub deployment. Got something in mind to clean up our hubs but I need to find time because it likely involves fixes to other things. I know I keep saying that and I know talk is cheap. I do agree there is a better way to do this and I think there are more general tools available to handle the problem.
prometheus - This is for monitoring k8s clusters. We get (loosely) comparable functionality because we're using Google-managed kubernetes clusters. I'd argue we shouldn't worry about this until we have a really good reason to worry about it. I don't think our users really use the monitoring tools we already have (they're getting better, though).
NFS - Maybe. Be careful about expenses and security with this - I really like NFS generally for apps, and automated work but I'm under the impression that using something like Google Filestore can be $$$ for the kind of data we handle. There might be a way around this? @bolliger32, you mention adaptive volumes in the above description, but are there any other benefits you're interested in?

delgadom commented 3 years ago

@brews the NFS store is just for the user directories - replacing SSDs. Right now each user is constrained to the size of the SSD we allocate for them, which is 10GB. An NFS store allows flexibility in this - we would obviously have to watch out for users storing 100 TB on it for sure. But that's not the intention. It also (potentially?) helps us with the issue of recovering/migrating user data from one cluster to another. Does that allay your fears or am I not understanding your concern?

brews commented 3 years ago

@brews the NFS store is just for the user directories - replacing SSDs. Right now each user is constrained to the size of the SSD we allocate for them, which is 10GB. An NFS store allows flexibility in this - we would obviously have to watch out for users storing 100 TB on it for sure. But that's not the intention. It also (potentially?) helps us with the issue of recovering/migrating user data from one cluster to another. Does that allay your fears or am I not understanding your concern?

Ahhhh, I gotcha. This makes me feel better. I have some detail questions but these are for another time. I'm game to hammering this out.

bolliger32 commented 3 years ago

A few more comments to @bolliger32 questions from this PR description. I agree that they're not things we should deal with in this PR but def worth bringing up (Thank you @bolliger32!).

Using hubploy: Ehhh. Personally I'm not crazy about a custom just app to manage CI/CD just for another custom app. A big part of the problem is that we're combining infra provisioning and deploying/configuring an app with a large number of complex components. So, it's not just a jupyterhub deployment. Got something in mind to clean up our hubs but I need to find time because it likely involves fixes to other things. I know I keep saying that and I know talk is cheap. I do agree there is a better way to do this and I think there are more general tools available to handle the problem.

Awesome! Yeah especially after you explained that we could have a tool that's doing deployment across multiple clusters (not all of which are jupyterhubs) it makes sense to try to use something like that rather than hubploy. Regardless, I do think it would be nice to get the deployment integrated into the CI (but also don't think we're in dire need of that immediately or anything). Look forward to getting a chance to check out your Pulumi stuff!

prometheus - This is for monitoring k8s clusters. We get (loosely) comparable functionality because we're using Google-managed kubernetes clusters. I'd argue we shouldn't worry about this until we have a really good reason to worry about it. I don't think our users really use the monitoring tools we already have (they're getting better, though).

Sweet - based on that explanation and learning a bit more about how they are using it I'd totally agree.

NFS - Maybe. Be careful about expenses and security with this - I really like NFS generally for apps, and automated work but I'm under the impression that using something like Google Filestore can be $$$ for the kind of data we handle. There might be a way around this? @bolliger32, you mention adaptive volumes in the above description, but are there any other benefits you're interested in?

Yeah - what @delgadom said :)

RhodiumGroup / helm-chart

Using dask-gateway #14