choldgraf commented 2 years ago

This is a pitch for a multi-sprint project - it uses a slightly different structure so that we can test out how it feels!

Problem statement

We have some basic hub deployment infrastructure set up. However this is missing a few important pieces of functionality:

We cannot deploy hubs in parallel, which means that one hub can bottleneck all others, and that updating one hub means updating all other hubs
Our testing infrastructure can be flaky, which makes it hard to distinguish a "true failure" from a "false failure"
Our configuration lumps all hubs into a single config file, which makes it harder to find and refer to one hub's configuration.

Proposed solution

We should improve our hub CI/CD infrastructure with a few specific focus areas:

Hubs should be deployed in parallel with one another in CI (not the deployer)
We should have automated testing infrastructure that will prevent hub deployment if it fails
Configuration should be done via a GitHub repository, and deployment should be automatically triggered

Here's a rough idea of how this could work:

Each cluster has a folder that contains configuration for each hub on the cluster
Each hub gets its own configuration file, which is a valid and standalone Z2JH config.
- Maybe there is another shared configuration file that each hub can draw from for common settings across hubs.
Adding a new hub's configuration will trigger the deployment of a JupyterHub with that configuration on that cluster.
We deploy staging hubs for each hub. We run some kind of test on the staging hub for common workflows, and if it fails then the production hub for that community does not deploy.
We have a "source of truth" for the hubs that we're running that is easy to glance through, which provides important information needed to maintain them, as well as a rough status of each.

Guidance for implementation

A few things that must exist in the final product:

Canary deployments: Deploying a hub should be stopped if an automated test does not pass.

Things to test for

A few ideas to include in testing infrastructure:

Unauthorized users cannot log in
Authorized users can spin up an interactive server properly
Run a notebook designed for the base environment top to bottom without errors
Run a notebook designed for a Dask Hub environment top to bottom without errors

No gos / out-of-scope

This should not become a deep dive into deploying Kubernetes on different cloud providers, unless we really think there's something crucial to solve there that would enable the deployment of hubs. Instead we should just assume there's a k8s cluster available and focus this on deploying hubs.

Related issues

These are related issues, and some (or all) may be in-scope to close as part of this project

[x] https://github.com/2i2c-org/infrastructure/issues/586

Updates and ongoing work

2022-01-19

We had a team meeting today to agree on a few next steps for @sgibson91 and @consideRatio to tackle some of this. Here's a rough idea of what we'd like to get done:

[x] @sgibson91 and @consideRatio (+ others if interested) schedule a time to speak and come up with a proposal for the configuration structure that we want.
- Ask "what features are we hoping / planning to support with 2i2c?" "What configuration structure will allow us to support those features?"
[x] Share the proposal with others and decide next steps
[x] https://github.com/2i2c-org/infrastructure/issues/857
[x] https://github.com/2i2c-org/infrastructure/issues/937
[x] https://github.com/2i2c-org/infrastructure/issues/279
[x] https://github.com/2i2c-org/infrastructure/issues/818

I believe that @sgibson91 is championing this with @consideRatio providing support and collaboration!

sgibson91 commented 2 years ago

I opened https://github.com/2i2c-org/infrastructure/issues/1009 to track the breakages to our list of running hubs page in the docs as the refactor takes place

sgibson91 commented 2 years ago

Refactoring proposal

At the minute, the deployer has a lot of hard-coded paths and we are particularly reliant on the config/clusters directory existing. To increase flexibility of how we may choose to structure our repo in the future, and make it easier to lift the deployer out into a standalone package, I am going to suggest some changes in convention to our file structure.

The cluster.yaml file is the only file the deployer should calculate the filepath for. Other files required to deploy a given hub should be explicitly defined relative to the position of the cluster.yaml file. We have already implemented this in a simple fashion. All files under helm_chart_values_files are relative to the cluster.yaml file they are listed in because they reside in the same folder.
We should move config around as needed to fit convention 1 in the most logical manner without traversing a lot of directories up/down where possible. We want to strictly avoid (but not enforce) overly-complicated relative filepaths, such as:
```
helm_chart_values_files:
 - this/is/a/directory/file.values.yaml
 - ../../../../../who/knows/where/this/is/file.values.yaml
```
While the first option could be considered reasonable since the filepath is explicit, the second one can become confusing due to the repetition of the .. operator. If we find ourselves writing complicated relative filepaths like this, we should reassess and move that config to somewhere more logical.
In the spirit of convention 2, we should no longer house our secrets under a specific folder and mimic directory structure under it. Instead, we should move all config for a cluster, encrypted or not, into the same place and use filenames to track if it should be encrypted and whether is encrypted or not. This convention would involve:
- Changing sops regex from "anything under the secrets directory to any filename containing secret
- No longer use the --in-place flag for sops. Specifically rename an encrypted file to have the enc- prefix. This would look something like this:
```
# Encrypting a file
sops --encrypt file.secret.yaml > enc-file.secret.yaml
# Decrypting a file
sops --decrypt enc-file.secret.yaml > file.secret.yaml
```
  We can then do two things with the deployer script: 1) identify a file that needs to be handled by sops depending on the enc- prefix, and 2) check a enc--prefixed file contains the sops keys, if not, raise an error that the file may not have been properly encrypted and the secret should be regenerated in case it has been checked into version control. This mimics what sops itself does: throws an error if it cannot find the sops key when decrypting a file. Some pseudo-code looks like:
```
if FILENAME.startswith("enc-"):
# Read in file WITHOUT decrypting first
encrypted_contents = read_yaml(FILENAME)
if "sops" not in encrypted_contents.keys():
   raise Error(
       "Expecting to find the 'sops' key in this encrypted file but it wasn't found! Please regenerate this secret if it has been checked into version control!"
   )
else:
   # Decrypt the file and read the decrytped contents
   return decrypted_contents
else:
# Do nothing - file should not be encrypted
```

Applying these conventions, an example file tree would look like:

config
|-- clusters
|   |-- 2i2c
|   |   |-- cluster.yaml  <-- cluster-wide config file
|   |   |-- enc-deployer-credentials.secret.yaml  <-- encrypted secret file
|   |   |-- staging.values.yaml  <-- hub-specific helm chart values for a hub called "staging"

consideRatio commented 2 years ago

@sgibson91 I love it, all ideas makes sense to me!

I'm adding some motivation for a convention to have secret in the filename of files with sensitive content that we want to encrypt, and motivation for also having an enc- prefix to indicate a file is encrypted.

We can detect these files in general via their filename pattern, and that can be helpful in various situations when we want to give them special treatment without needing to inspect their content etc.

We can for example use .gitignore to ensure we don't commit non-encrypted secrets. This would be done like this, I think.

# To clean up ignored files you can use: git clean -Xfd
# Delete untracked files (-X), with required confirmation (-f), recursively (-d).
#
# This section handles sensitive content that is or isn't encrypted in our repo.
#
# Ignore files (but not directories) with secret in their name
*secret*
!*secret*/
# Don't ignore enc- prefixed files
!enc-*
# Don't ignore Helm chart templates with secret in their name as
# they are not supposed to contain sensitive content but can often
# have secret in their name.
!**/templates/**/*secret*.yaml

GeorgianaElena commented 2 years ago

+1 from me! I like the new suggested struct! Thanks for thinking these things through :blossom: :rocket:

choldgraf commented 2 years ago

This looks good to me as well, and it seems reasonable to have all configuration + secrets for a cluster "under one roof".

One question is how this will relate to the multi-cluster spawner work that @yuvipanda has been prototyping. I guess in this case, you would have a parent cluster that has the hub itself, and this cluster would contain configuration files for the spawner clusters where people could create interactive sessions. Does that sound right?

consideRatio commented 2 years ago

I guess in this case, you would have a parent cluster that has the hub itself, and this cluster would contain configuration files for the spawner clusters where people could create interactive sessions. Does that sound right?

I think the technical changes made as part of this issue is not going to lock into a practice that restricts us later, but rather make them more flexible. If we need new conventions in the future, this is still a nudge in the right direction.

At this point I struggle to carve out conclusions influencing this work by the thought experiment of imagining us maintaining a multicluster spawner deployment.

sgibson91 commented 2 years ago

IIRC the hub for the multi-cluster spawner doesn't even need to be on a k8s clusters, it can be a plain old VM because the hub part is the only bit it needs to run there.

But the point is, these changes we make now are being made in such a way that we can be flexible about config location in the future without needing to rewrite the deployer too much and also without making decisions about potential future config right now.

choldgraf commented 2 years ago

Sounds good to me 👍

damianavila commented 2 years ago

+1 on the last @sgibson91 proposal

sgibson91 commented 2 years ago

Merging #1027 caused a couple of bugs. I'm trying to think about fixing number 2 listed here: https://github.com/2i2c-org/infrastructure/pull/1027#issuecomment-1054196676 Copying the content here so we can continue discussion more linearly.

For pangeo-hubs, uwhackweeks, and utoronto, there is an issue reading in the secret helm chart values files. Error:

1 / 2: Deploying hub staging...
Getting updates for unmanaged Helm repositories...
...Successfully got an update from the "https://jupyterhub.github.io/helm-chart/" chart repository
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kvaps" chart repository
...Successfully got an update from the "jetstack" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 1 charts
Downloading jupyterhub from repo https://jupyterhub.github.io/helm-chart/
Deleting outdated charts
Running helm upgrade --install --create-namespace --wait --namespace=staging staging /Users/sgibson/source/github/2i2c-org/pilot-hubs/helm-charts/basehub --values=/var/folders/hb/3kw2n13d17l7rnljn9jgplr00000gn/T/tmp14nt1lt2 --values=/Users/sgibson/source/github/2i2c-org/pilot-hubs/config/clusters/utoronto/staging.values.yaml --values=/var/folders/hb/3kw2n13d17l7rnljn9jgplr00000gn/T/tmpwru2e5dp
**Error: open /var/folders/hb/3kw2n13d17l7rnljn9jgplr00000gn/T/tmpwru2e5dp: no such file or directory**
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/pilot-hubs/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Caskroom/miniconda/base/envs/pilot-hubs/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/sgibson/source/github/2i2c-org/pilot-hubs/deployer/__main__.py", line 324, in <module>
    main()
  File "/Users/sgibson/source/github/2i2c-org/pilot-hubs/deployer/__main__.py", line 307, in main
    deploy(
  File "/Users/sgibson/source/github/2i2c-org/pilot-hubs/deployer/__main__.py", line 215, in deploy
    hub.deploy(k, SECRET_KEY, skip_hub_health_test)
  File "/Users/sgibson/source/github/2i2c-org/pilot-hubs/deployer/hub.py", line 611, in deploy
    subprocess.check_call(cmd)
  File "/usr/local/Caskroom/miniconda/base/envs/pilot-hubs/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--create-namespace', '--wait', '--namespace=staging', 'staging', PosixPath('/Users/sgibson/source/github/2i2c-org/pilot-hubs/helm-charts/basehub'), '--values=/var/folders/hb/3kw2n13d17l7rnljn9jgplr00000gn/T/tmp14nt1lt2', '--values=/Users/sgibson/source/github/2i2c-org/pilot-hubs/config/clusters/utoronto/staging.values.yaml', '--values=/var/folders/hb/3kw2n13d17l7rnljn9jgplr00000gn/T/tmpwru2e5dp']' returned non-zero exit status 1.

The problem is that verify_and_decrypt_file is a context manager, so as soon as the with statement below closes, the temporary file is destroyed and can no longer be read when it comes to executing the helm command

https://github.com/2i2c-org/infrastructure/blob/d75d66d09319c2c963e742afab5e9e72026f2f90/deployer/hub.py#L561-L573

sgibson91 commented 2 years ago

Before we had code that looked like this:

https://github.com/2i2c-org/infrastructure/blob/3d9b57dfdb2ab4d5a53ddb7d555ddf8f7afbb9ed/deployer/hub.py#L564-L572

This ensured that the temporary files still existed when it came to executing the helm upgrade command, but hard-coded how many secret files we were expecting.

I'm not sure how to address this 😕

consideRatio commented 2 years ago

@sgibson91 I think you may need an ExistStack. You can't use with on a unknown amount of contextmanagers, but you can wrap the unknown amount of contextmanagers.

Here is an example that you could use, I think, I haven't tested this.

@contextmanager
def _get_file_paths_and_decrypt_them_first_if_needed(files):
    """
    This is a context manager that when entered provides a list of
    file paths, where temporary files has been created if needed for
    files that was encrypted and first needed to be decrypted.
    """
    with ExitStack() as stack:
        yield [
            stack.enter_context(verify_and_decrypt_file(f))
            if f.startswith("enc-") or ("secret" in f) else
            f
            for values_file in self.spec["helm_chart_values_files"]
        ]

I think this code is a bit ugly, and that perhaps the verify_and_decrypt_file function can be remade to allow a non-encrypted file to be passed as well. I figure for example you could have the functions get_decrypted_file and get_decrypted_files.

@contextmanager
def get_decrypted_files(files):
    """
    This is a context manager that when entered provides a list of
    file paths, where temporary files has been created if needed for
    files that was encrypted and first needed to be decrypted.
    """
    with ExitStack() as stack:
        yield [
            stack.enter_context(get_decrypted_file(f))
            for f in self.spec["helm_chart_values_files"]
        ]

857 is close to closeable. It's definitely closeable if we don't consider the support chart refactor part of that issue

sgibson91 commented 2 years ago

This is the more granular to-do list I am currently working from https://github.com/2i2c-org/infrastructure/issues/879#issuecomment-1024144966

choldgraf commented 2 years ago

That's helpful, thanks! So the question we need to answer then is how to prioritize these three things:

Against other development work that we need to get started soon (for example, setting up BinderHub infrastructure, or improving the story around the configurator+user environment tracking). I would love to hear people's thoughts on how we should prioritize. We can discuss in the meeting on Thursday as well.

sgibson91 commented 2 years ago

We definitely need #818 - that was the whole point of undergoing this refactoring work!

choldgraf commented 2 years ago

So it sounds like you and @consideRatio suggest that we should prioritize https://github.com/2i2c-org/infrastructure/issues/818 over the other projects we could be working on. Is that correct? And if so, can we time-box ourselves so that we have a checkpoint for deciding what to do next? What would be a reasonable estimate to shoot for?

I just want to make sure that this effort is clearly-scoped enough that we are making the right decisions about our own bandwidth and effort, given that there are many other things to do as well. Over time, other things are going to become increasingly important for us to tackle (running a BinderHub for Pangeo is a good example of this)

consideRatio commented 2 years ago

If work on these three issues is done in sequence, it should be #818 -> #937 -> #279 I think, but it can also be #818 + #937 in parallel followed by #279 finally.

818 is absolutely most work.

choldgraf commented 2 years ago

If we think it's realistic, then I think it'd be great if we could do

In parallel:

and finally

https://github.com/2i2c-org/infrastructure/issues/279

Is that realistic? E.g. could you and @sgibson91 tag-team these? Or would we need another team member to help out with them?

(by the way, apologies if I'm coming across as pedantic here! I am just trying to make sure we're weighing this project against our other options to make sure we're having the most impact!)

sgibson91 commented 2 years ago

I would love if we could do the tag-team variation on this. But also, I think completing #818 then pausing and coming back to #937 and #279 is plausible. But I don't really wanna give them up completely, because it's really foundational work that will increase the amount of impact we'll have beyond any single BinderHub for a single community.

consideRatio commented 2 years ago

I'd like to postpone work #937 until i feel I've managed other work that feels pressing unrelated to 2i2c infra: z2jh/kubespawner/dask-gateway releases and associated changes for those + some jmte maintenance work.

Happy to do #937 at some point in time, but would prefer not going for it at this point in time.

But I don't really wanna give them up completely, because it's really foundational work that will increase the amount of impact we'll have beyond any single BinderHub for a single community.

I'm not sure if I understand you now. But z2jh, binderhub, and dask-gateway has bundled value file schema's already. For #937 I'm specifically thinking that the charts defined by 2i2c should get schemas for the values as well.

sgibson91 commented 2 years ago

I'm not sure if I understand you now.

Sorry. I mean that all the work encapsulated in this project issue is foundational to the impact 2i2c can have because it's related to the reliability, robustness and sustainability of our infrastructure at large. So we can be more successful over a longer time period by investing in this kind of work, compared to the impact deploying a single BinderHub for a single community would have. But I know that's a hard sell to the people handing over their money.

choldgraf commented 2 years ago

I think the main benefit of this work is both ourselves (because we will spend less time fighting unexpected fires, or wrapping our heads around confusing config structure) and the communities we serve (because of the reasons @sgibson91 describes). So the prioritization question is basically "how much of an impact will the remaining work have on our own efficiency and robustness".

I think our question should be "where can we spend 10 hours now to save ourselves 100 hours of work later on?"

On the BinderHub side, it's worth noting that we have other BinderHub-style opportunities coming up as well. GESIS would like to fund some BinderHub-style work (https://github.com/2i2c-org/leads/issues/56) and CalTech would also like to collaborate on a BinderHub grant (https://github.com/2i2c-org/leads/issues/62). So this is not a one-off, it is towards building a foundation for running BinderHubs for others as well. (and nonetheless, as @sgibson91 we are contractually obligated to run this BinderHub, so we need to set it up eventually)

If @consideRatio doesn't have the bandwidth to help with https://github.com/2i2c-org/infrastructure/issues/937 right now, then I think we have two options:

Re-direct capacity from @yuvipanda / @damianavila / @GeorgianaElena to focus on https://github.com/2i2c-org/infrastructure/issues/937
Agree to post-pone https://github.com/2i2c-org/infrastructure/issues/937 and focus this issue on https://github.com/2i2c-org/infrastructure/issues/818

Does that sound right? It also sounds like folks are leaning towards option 2 above, as long as we track the validation stuff well-enough that we know we'll get to it later.

I think the reason not to do this is if we think that not validating our charts will potentially generate a big problem for us in the future. But I don't have a great intuition for this right now

sgibson91 commented 2 years ago

Does that sound right? It also sounds like folks are leaning towards option 2 above, as long as we track the validation stuff well-enough that we know we'll get to it later.

Yes, I agree this work is important enough to come back to after we address some of the more pressing development issues, and not close it completely.

I think the reason not to do this is if we think that not validating our charts will potentially generate a big problem for us in the future. But I don't have a great intuition for this right now

Not having this kind of validation makes debugging failed helm upgrades more difficult and involved than it needs to be - that could cover a range of problems from minor to big.

consideRatio commented 2 years ago

Not having this kind of validation makes debugging failed helm upgrades more difficult and involved than it needs to be - that could cover a range of problems from minor to big.

Sometimes just knowing we have validation to catch an issue, makes you not consider that could be an issue - which would save time even though there isn't an issue related to how values were provided.

I love it when my trust in something grows to the extent that I don't need to worry about it. I think perhaps the main benefit of this would be that we can relax a bit with regards to review of changes even though we would also catch mistakes quicker.

consideRatio commented 2 years ago

I created #1045 to address both #937 and #279.

consideRatio commented 2 years ago

2i2c team meeting notes:

Sarah plans improvements to start to finish in two weeks

sgibson91 commented 2 years ago

Hey @2i2c-org/tech-team! The next step of this issue is to provide logic in the deployer that can decide which clusters/hubs need a helm upgrade based off changed filepaths (from a PR, say). I've been experimenting with this idea and have put together a test repository and given you all collaborator access: https://github.com/sgibson91/helm-upgrade-decision-logic

Please feel free to spend some time opening and self-merging PRs to that repo and checking the workflow logs to see if the jobs you were expecting to trigger were indeed triggered! I'll try to write some tests for it as well, and when we're happy it works, I'll migrate the code into the deployer and begin refactoring our deploy-hubs workflow.

sgibson91 commented 2 years ago

I migrated the Python code from the above repo into the following PR https://github.com/2i2c-org/infrastructure/pull/1093

damianavila commented 2 years ago

We agreed, at the planning meeting, that this one is now complete! Congrats @sgibson91 and @consideRatio who were the ones pushing this one forward!!

choldgraf commented 2 years ago

wohooo! what an impressive leap forward for our infrastructure! @sgibson91 I wonder if you'd be interested in working with me on a short blog post that explains some of these changes? there were some folks on twitter who were intrigued about our GitHub actions :-)

sgibson91 commented 2 years ago

@choldgraf yep!

sgibson91 commented 2 years ago

I opened https://github.com/2i2c-org/team-compass/issues/411 to track the blog post

2i2c-org / infrastructure

CI/CD infrastructure for hubs that is automated, tested, and parallelized #879

Problem statement

Proposed solution

Guidance for implementation

Updates and ongoing work

2022-01-19

Refactoring proposal

Related reading

857 is close to closeable. It's definitely closeable if we don't consider the support chart refactor part of that issue

818 is absolutely most work.