Closed bolliger32 closed 3 years ago
This is some great work, @bolliger32! And it warms my heart to see PRs with unit tests in them.
You mention "The get_*_cluster commands should be the exact same for almost all use cases." Are there any gotchas our use cases that we might need to watch for with this PR?
One potential change on quick look, you have aTODO: More tests
in HISTORY. I don't know if that's cruft or intentional, or maybe this is still a work in progress and I need to back off.
Regardless, thanks for putting this together!
Haha thanks!
As for the get_*_cluster
commands, I think the main difference I see is that not all of the kwargs are available with the current set-up of dask-gateway. They all could be - the main difference is that the cluster config interface is customized within the helm chart. So basically, we can expose any options we want at deployment time. In theory, if we hadn't already gotten people using the rhgk.get_cluster toolkit, I'd say we should just use dask-gateway directly b/c you can basically do the same thing by properly configuring which cluster options you expose to users.
scaling_factor
or cpu/memory sizes. I chose to just allow 4 choices of worker size (what we previously called micro, standard, big, giant). But we could allow this if we wanted to. It yells at you if you are on a notebook image that has dask-gateway installed but try to use one of these kwargsworker_template.yml
file, so the kwarg that points you to a different template is no longer an option.dask.config.update(some_config_dict)
before calling one of the get_cluster commands. Would not be hard to include this in the gateway get_cluster function as this is actually upstream of instantiating either a KubeCluster or GatewayCluster.One other concern with dask-gateway of note - you do still end up passing credentials over the cluster which I would imagine has some potential security issues? I don't really know of a better way to do this... and we were already doing this via the GCSFUSE_TOKENS env var. One concern with our current dask-gateway deployment is that if someone displays the HTML repr of a cluster options object in the notebook, it has a text field that contains their credentials. I don't think this information ever makes it into the json of the notebook so wouldn't get saved in a repo accidentally, but it does seem a little strange. I don't see a great way to pass credentials that would avoid this - and users would never interact with one of those options objects unless they are bypassing the rhg_compute_tools framework to use dask_gateway directly. So I don't think it's a big issue. But just something I wanted to flag in case you thought it was a bigger issue.
As for the TODO...I was originally hoping to spin up an actual kubernetes cluster inside the CI (using k3s) and then deploy dask gateway on the kuberentes cluster (rather than using a local backend for dask-gateway). This would allow for some more thorough testing that options were getting passed correctly (some options don't actually have an effect in the local cluster (like changing the assigned worker memory size for some reason, which confuses me...). But that seems like a fair amount of work and potentially something to do for a future PR. In fact, I'll file an issue for that now :) (#88 )
@brews and @delgadom now that we're using this branch of rhg_compute_tools on all servers, I think we should merge it in. I know @brews gave it a look a couple months ago and I don't think there are any outstanding issues, but I wanted to check in before I hit merge.
And then I think it's worth a v0.3.0 tag when this is merged in.
@brews would love your post-holiday take on the GH actions I've set up to migrate from travis, as I'm still figuring these out. I've got two workflows. One for every push and pull request, which builds and runs tests. And then one for every release, which just builds and pushes to pypi.
I'm definitely learning both GH actions and python package publishing as I'm going, so let me know if things don't look right. I think I've tested out the release workflow though and it seems to successfully push to pypi. Right now theres a v1.0.0a2 release up there which was uploaded when I created a release.
I did not include any testing in the release workflow b/c I assume that would only be implemented following a merge into master, so all of the tests would already have been run. But we could change this if you think better to include tests there too.
Just two files to look at really - the two under .github/workflows
Merged in order to be able to Mark current images as stable. The GH actions still could use a look over. We can update those via a follow up PR if needed
flake8 rhg_compute_tools tests docs
Reworked some of the
kubernetes
module to allow for the use of dask-gateway when it is available (currently only set up to run with a coastal image on adrastea). There will be some related PR's to the helm-chart and docker_images repos.Tried to keep the user-facing api as similar as possible. The
get_*_cluster
commands should be the exact same for almost all use cases.Testing is somewhat limited b/c not all of the options you can pass to a cluster object with a kubernetes backend are available for a local backend (which is what is used on travis for the tests). But I think it's a start.
I think there was a black update that changed the autoformatting behavior a bit. I ran black on everything and it changed a few other files as well (sorry about that). I had thought everything had had black already run on it, but I think the version change may have introduced a change in behavior.
EDIT 12/10/20: Added "tag" argument to supersede #80