2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

CI/CD infrastructure for hubs that is automated, tested, and parallelized #879

Closed choldgraf closed 2 years ago

choldgraf commented 2 years ago

This is a pitch for a multi-sprint project - it uses a slightly different structure so that we can test out how it feels!

Problem statement

We have some basic hub deployment infrastructure set up. However this is missing a few important pieces of functionality:

Proposed solution

We should improve our hub CI/CD infrastructure with a few specific focus areas:

Here's a rough idea of how this could work:

Guidance for implementation

A few things that must exist in the final product:

Things to test for

A few ideas to include in testing infrastructure:

No gos / out-of-scope

Related issues

These are related issues, and some (or all) may be in-scope to close as part of this project

Updates and ongoing work

2022-01-19

We had a team meeting today to agree on a few next steps for @sgibson91 and @consideRatio to tackle some of this. Here's a rough idea of what we'd like to get done:

I believe that @sgibson91 is championing this with @consideRatio providing support and collaboration!

sgibson91 commented 2 years ago

We should have automated testing infrastructure that will prevent hub deployment if it fails

I think this needs some clarification. Currently, we run automated tests after we have deployed an upgrade to a hub. So this statement could either mean:

Canary deployments: Deploying a hub should be stopped if an automated test does not pass.

Again, clarification here. This should probably say "deploying to production-level hubs is halted if deployment to staging hubs fail"

Adding a new hub's configuration will trigger the deployment of a JupyterHub with that configuration on that cluster.

This already happens since the CI/CD runs the deployer and the deployer uses the --install flag to create a deployment if it doesn't already exist. But we often don't do it in practice because, if for some reason deployment fails, we can currently only see logs locally, not in CI/CD, as secrets can be leaked.


I think it would also be cool to add test-this-pr to this project to enable testing on staging hubs from a PR. This fleshes out the canary deployments.


I am very keen to champion this project :smile:

consideRatio commented 2 years ago

I'd like to be an assistant on this project!

This work requires changes to the deployer configuration file schema. I suggest two phases of this work.

  1. Phase 2: Refactoring of the deployer script's configuration file schema
    1. Suggest what changes are required (@sgibson91 @consideRatio)
    2. Review, iterate, and agree on changes (@2i2c-org/tech-team)
    3. Implement refactoring changes (@sgibson91)
  2. Phase 2: Update CI system
    • Add helm template validation (#279) (@consideRatio does it, @sgibson91 assists)
    • Optimize deployment ordering (#818) (@sgibson91 does it, @consideRatio assists)
  3. At any time
    • I think #586 can be worked on at any time independently from the other tasks.
choldgraf commented 2 years ago

Related to #908

We had an incident where unauthorized users could access a hub because the hub's configuration did not have allowed_users set (thus, JupyterHub allowed all users).

We should add some kind of validation for certain configuration that we know needs to be set. Maybe this is part of #279 ?

choldgraf commented 2 years ago

Update

In our latest team meeting it seemed like @sgibson91 and @consideRatio were interested in focusing their efforts on this work for their next major focus area. To begin with they were going to work on https://github.com/2i2c-org/infrastructure/issues/857, and this effort will likely create new ideas for what else needs to be done in the CI/CD.

@consideRatio and @sgibson91 - how do you feel about scoping this effort for two weeks, and then doing a re-assessment and adjusting our scope then? The "related issues" section in the top comment has a few issues that could be tackled here, maybe we pick whatever subset makes sense from those, and try and get that done in the next 2 weeks? We can then decide what should go into subsequent cycles.

sgibson91 commented 2 years ago

I think let's see how far we get in a month and then decide if what's left is worth another cycle?

choldgraf commented 2 years ago

That sounds good to me - but let's define what work we want to accomplish in that month. Of all the improvements that we could make towards CI/CD, what is a reasonable subset to shoot for? We can adjust over time as we learn more, but we'll be more effective if we have a target to aim towards.

Can @sgibson91 and @consideRatio scope out what they think would be the most impactful and realistic given that time frame?

sgibson91 commented 2 years ago

I think definitely:

choldgraf commented 2 years ago

Planning Meeting / Next steps

I've updated the top comment with these ideas so we can map out our next steps. I wonder if @sgibson91 would be willing to take a look in the top comment and decide if this is a reasonable scope for a one month timebox? We can also adjust the scope up or down depending on how quickly we can implement, but I think it's good to have a target to shoot for at least

sgibson91 commented 2 years ago

https://github.com/2i2c-org/infrastructure/issues/903 is a small fix that I think would be easily tackled as part of this project too

sgibson91 commented 2 years ago

@2i2c-org/tech-team we have outlined a proposal for the refactor of the file structure here: https://hackmd.io/@29EjZnuAQx-ME7Olgi2JXw/Sk5S7xP6F

damianavila commented 2 years ago

I like the proposal!! Left a tiny comment but, from the top of my head, I do not see a blocker so I would be +1 on it.

choldgraf commented 2 years ago

I went through and left a few comments myself as well - in general I think it looks like a pretty clean and well-structured setup. A nice improvement!

sgibson91 commented 2 years ago

Thanks for the comments so far! If there are no strong pushbacks in the next day or so, I'll begin implementing the batches of work outlined in the document.

sgibson91 commented 2 years ago

I am going to start working on the following todos from the hackmd in order:

sgibson91 commented 2 years ago

If anyone has opinions on https://github.com/2i2c-org/infrastructure/issues/965 while I'm poking around in the deployer, I'd love to hear

sgibson91 commented 2 years ago

Update from a quick Slack chat with Erik: values files should be defined relative to the location of the cluster.yaml file, not the location of where the deployer is invoked from

sgibson91 commented 2 years ago

Drafting in here: https://github.com/2i2c-org/infrastructure/pull/985

sgibson91 commented 2 years ago

How to deal with these sections is going to cause a headache. We don't know ahead of time which files contain the placeholder, so I can't see a way around this other than reading in every file in the list until you find the placeholder.

https://github.com/2i2c-org/infrastructure/blob/701e4a78c46e55c90df453a56b2c25461ac20110/deployer/utils.py#L49-L106

consideRatio commented 2 years ago

I'm not sure about "staff placeholders" meaning, there is no docstring for this function :cry:. I think though, its templating configuration for various values to add/remove 2i2c staff to various lists? Maybe that should be done in logic as part of the jupyterhub startup routine as a hub.extraConfig entry (like a jupyterhub_config.py snippet) instead?

sgibson91 commented 2 years ago

It is replacing this:

https://github.com/2i2c-org/infrastructure/blob/701e4a78c46e55c90df453a56b2c25461ac20110/config/clusters/pangeo-hubs.cluster.yaml#L78

with the list maintained here:

https://github.com/2i2c-org/infrastructure/blob/master/config/clusters/staff.yaml

for all hubs

consideRatio commented 2 years ago

Okay hmmm. I suggest we remove this logic from the deployer in favor and embedding such logic as a script snippet to run as part of the jupyterhub_config.py via hub.extraConfig.

  1. Use custom.2i2c.staffUsers as a list for all 2i2c staff accounts
  2. Use custom.2i2c.addStaffUsersToAdminUsers or similar boolean flag
  3. Use hub.extraConfig.2i2c_addStaffUsersToAdminUsers in basehub, with python code that appends to the c.Authenticator.admin_users list dynamically.

@2i2c-org/tech-team it would be great to have your input on a decision like this!

GeorgianaElena commented 2 years ago

@2i2c-org/teams/tech-team it would be great to have your input on a decision like this!

Dealing with the staff situation gave me lots of headaches too :( As someone who already tried various ways of implementing this and failed a few times from various reasons (https://github.com/2i2c-org/infrastructure/pull/311), I will dare to suggest to just go with the solution that makes life easier! The placeholder implementation was such type of solution too. So, whatever makes more sense with the new refactored structure.

The script suggestion sounds good :+1: (though I remember trying and failing to implement something similar because I didn't know how to determine which authentication method was chosen (google/github) from within the hub.extraConfig and also the username_pattern seemed hard to determine :disappointed:)

consideRatio commented 2 years ago

Any additional information for the script in hub.extraConfig can be passed via the custom section, preferably for example custom.2i2c.<something> as that would help clarify that this is 2i2c that has given this custom config meaning for anyone down the line to better comprehend what's related to what.

Here is a reduced example of what I'm thinking.

custom:
  2i2c:
    addStaffUsersToAdminUsers: true
    staffUsers:
      - GeorgianaElena
      - sgibson91
      - consideratio
    # in practice, we need to have a list of different types of accounts and
    # some logic to know what to use, perhaps by inspecting the kind of
    # c.JupyterHub.authenticator_class has been configured or via
    # via explicit declaration in `custom.2i2c.staffUserListToUse` etc
hub:
  extraConfig:
    2i2c_addStaffUsersToAdminUsers: |
      # ...
      staff_users = get_config("custom.2i2c.staffUsers", [])
      add_staff_users_to_admin_users = get_config("custom.2i2c.addStaffUsersToAdminUsers", False)
      if add_staff_users_to_admin_users:
          c.JupyterHub.admin_users.extend(staff_users)
sgibson91 commented 2 years ago

@consideRatio Are you suggesting we add the staff list in basehub too? The benefit of staff.yaml is that the list only had to be maintained in one place for all staff members to get access to all hubs, and I wanna make sure we preserve that.

consideRatio commented 2 years ago

Are you suggesting we add the staff list in basehub too?

Yes I suggest both the script (hub.extraConfig...) and the staff list (custom.2i2c...) is centralized still, for example via basehub default values or via a dedicated helm chart values file that the various charts reference as one --values <file> to pass as part of helm upgrade etc.

GeorgianaElena commented 2 years ago

Any additional information for the script in hub.extraConfig can be passed via the custom section, preferably for example custom.2i2c.

That's a good idea! However, I still don't see how we can determine the auth type from within the script, without duplicating/moving this info in the custom section (the staff list differs based on the auth type). https://github.com/2i2c-org/infrastructure/blob/701e4a78c46e55c90df453a56b2c25461ac20110/config/clusters/2i2c.cluster.yaml#L24

It might be that I'm missing something though

consideRatio commented 2 years ago

You are probably right about it being hard to decide what kind of user identity should be used. The authenticator may be against auth0 no matter what, but still different usernames etc should be used.

Since that is a configuration that I understand to be on how the deployer script automatically setups things with auth0, I suggest not entangling it with the choice of user lists anyhow which may not relate to auth0.

Here is a config structure proposal and perhaps functional example to add to the basehub default values, but to be combined with each helm chart configuring their own dedicated values that include setting custom.2i2c.add_staff_user_ids_to_admin_users and custom.2i2c.add_staff_user_ids_of_type.

While they could rely on a default value in the basehub helm chart, I think it's better that they don't to increase comprehensibility.

custom:
  2i2c:
    # Should staff user id's be injected to the admin_users
    # configuration of the JupyterHub's authenticator by our
    # custom jupyterhub_config.py snippet as declared in hub.extraConfig?
    add_staff_user_ids_to_admin_users: false
    add_staff_user_ids_of_type: ""
    staff_github_ids:
      - choldgraf
      - consideRatio
      - damianavila
      - GeorgianaElena
      - sgibson91
      - yuvipanda
    staff_google_ids:
      - choldgraf@2i2c.org
      - erik@2i2c.org
      - damianavila@2i2c.org
      - georgianaelena@2i2c.org
      - sgibson@2i2c.org
      - yuvipanda@2i2c.org
hub:
  extraConfig:
    2i2c_add_staff_user_ids_to_admin_users: |
      add_staff_user_ids_to_admin_users = get_config("custom.2i2c.add_staff_user_ids_to_admin_users", False)
      if add_staff_user_ids_to_admin_users:
          user_id_type = get_config(f"custom.2i2c.add_staff_user_ids_of_type")
          staff_user_ids = get_config(f"custom.2i2c.staff_{user_id_type}_ids", [])
          c.JupyterHub.admin_users.extend(staff_user_ids)
choldgraf commented 2 years ago

I will dare to suggest to just go with the solution that makes life easier! The placeholder implementation was such type of solution too. So, whatever makes more sense with the new refactored structure.

I think that this is the right approach here. I can't speak to the specific proposal because I don't understand the deployer setup well enough, but I recall that in previous conversations, we were trying to balance "what works best for 2i2c" as well as "what makes it easy for others to replicate our setup from a R2R perspective".

At this point, I think it's more important that we make our lives easier and simpler. If it means there is some 2i2c special-casing in our deploying infrastructure, I think that's fine. I think it will be better to enable "Right to Replicate" by making sure the Z2JH / TLJH docs are good, and providing others some guidance in following them, rather than by choosing our own deployment setup so that it is copy/pasteable.

Not sure if this is a helpful comment or not, but thought I'd share in case this was a decision-factor

choldgraf commented 2 years ago

Hey all - since @sgibson91 is off next week, can we shoot a target for what we'd like to complete on this before the end of this week? Or are there things that @consideRatio would like to work on next week while Sarah is gone?

It seems like these are the main two PRs we've got active right now:

is it realistic to get both of those in by EOD friday?

consideRatio commented 2 years ago

Or are there things that @consideRatio would like to work on next week while Sarah is gone?

I have a full plate of work things to focus on already for at least two weeks. I'm not sure on the amount of work remaining to get those PRs merged, but I'd be happy to expedite review effort.

sgibson91 commented 2 years ago

I think I should be able to have #988 ready for review/merge today and then can push as far as I can on #985. Theoretically, if there are no more blockers, it should be doable by the end of the week.

sgibson91 commented 2 years ago

https://github.com/2i2c-org/infrastructure/pull/988 is ready for review and I'd like to merge it before continuing with #985 as there will be merge conflicts

sgibson91 commented 2 years ago

Ok, next blocker 😅

This function expects to be able to read the helm chart values in order to see if the docs_service is enabled and generate a token for it.

https://github.com/2i2c-org/infrastructure/blob/fcd83405e42847b7b322d7d91b94e84a21c0d2b1/deployer/hub.py#L504-L511

The generated_config never gets explicitly written to the helm chart values file. It is written to a temporary file and passed as another values file instead.

https://github.com/2i2c-org/infrastructure/blob/fcd83405e42847b7b322d7d91b94e84a21c0d2b1/deployer/hub.py#L564-L595

How do we establish which file being passed via helm_chart_values_files contains the appropriate config for the docs_service? I'm thinking in the case where we have staging and prod, the prod values file is only providing minor changes to the staging values file. Is it ok to just assume staging has it in this case?

consideRatio commented 2 years ago

How do we establish which file being passed via helm_chart_values_files contains the appropriate config for the docs_service?

I think we must avoid a pattern where we configure the deployer script to look for a specific file. This is too magical and complicated.

I'm thinking of the following options:

  1. We break the DRY principle, making the deployer script be configured via the cluster.yaml config to add a temporary helm chart values file with this secret.
  2. We figure out a way to coalesce all the values files just like helm itself does, and then we read from the resulting version. Preferably we could ask helm to coalesce the values so its always done in the helm way of doing it, but otherwise it could also be done manually. In other words, we would use a function like "read all helm values files -> coalesce them -> return python object representing that" then we read from that.

Can we make helm do the coalescing of values? Hmm... I know that if you use --debug when doing helm upgrade etc, you get to see all values. Can this be used with helm templates as well? Can we extract it so we get to see only the values coalesced and not the rendered templates? Not sure.

consideRatio commented 2 years ago

I think what I'd like most, would be if we broke the DRY principle. That would help demystify the deployer script which is too magic too me overall - doing things I don't keep track of it doing.

We have planned for something like...

hubs:
  - name: staging
    domain: staging.us-central1-b.gcp.pangeo.io
    helm_chart: daskhub
    auth0:
      enabled: false
    helm_chart_values_files:
      - staging.values.yaml

What I'm thinking would help a lot with comprehensibility of what is done by the deployer script, and this situation, would be if we had like...

hubs:
  - name: staging
    domain: staging.us-central1-b.gcp.pangeo.io
    helm_chart: daskhub
    auth0:
      enabled: false
    helm_chart_values_generated:
      docs_service_api_token: true
    helm_chart_values_files:
      - staging.values.yaml
sgibson91 commented 2 years ago

Is token generation not a thing we could move into a basehub tpl file? Like: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/417e84e81101f5757c5bb735bd7daf249696c45f/jupyterhub/templates/hub/_helpers-passwords.tpl

consideRatio commented 2 years ago

@sgibson91 ooooh yes! Actually, by not specifying an API token, one is automatically generated by Z2JH! That means, the basehub helm chart can simply reference it directly - and voila, no need to do anything by let it be accessed! :D

I'll see if I have written an example about this already in z2jh docs.

Yes I had! See https://zero-to-jupyterhub.readthedocs.io/en/latest/resources/reference.html#hub-services!

consideRatio commented 2 years ago

@sgibson91 hmmm... I'm confused though about who is using the token. To have a api_token for a JupyterHub registered service means that that service can be granted permission to speak with JupyterHub, but... I don't see any use of that token by the docs service - the docs service doesn't seem to use any such token...

Does it even need a token? I don't think so.

consideRatio commented 2 years ago

It could be that this is a legacy of a bug that was fixed, about needing an api_token specified for a service, even though it really shouldn't be needed. So, an option perhaps is to simply not specify such token at all any more if its only created for that purpose.

GeorgianaElena commented 2 years ago

Found the relevant PR @consideRatio https://github.com/2i2c-org/infrastructure/pull/445 ~I don't think we figure out why we ended up needing it :confused:~

We actually did, @consideRatio just linked the relevant PR in the link above! Thanks Erik :tada:

consideRatio commented 2 years ago

Ah, the resolution was https://github.com/jupyterhub/jupyterhub/pull/3531, available in jupyterhub version 1.4.2, which is available in z2jh 1.1.0+, and we use version: 1.1.3. We can safely just whipe the api token generation part.

Okay nice! So we just stop generating the api token part. Further, we have it written out to say... "url": f'http://docs-service.{self.spec["name"]}', where we specify the service by also being explicit about what namespace - but that may not be needed as long as the proxy pod is in the same namespace as the docs service. So, it can be hardcoded to http://docs-service i think.

And, then, suddenly we have no config related to a secret, and no config related to the namespace etc. Voila - we can hardcode a default value in basehub just like is done for configurator:

https://github.com/2i2c-org/infrastructure/blob/4baa7a7a2bcc5caf48558269ac2dbf3090301d89/helm-charts/basehub/values.yaml#L272-L279

We should not specify a command though for this docs_service, we just want the proxy pod to be configured to proxy traffic to the docs service.

sgibson91 commented 2 years ago

This is amazing work - thank you @consideRatio and @GeorgianaElena!!!

Do you think the tpl files would be a good thing to implement for the hub-health-service token and the dask-gateway api token? If so, we can get rid of the apply_hub_helm_chart_fixes function completely which I think will greatly reduce complexity!!

sgibson91 commented 2 years ago

I opened https://github.com/2i2c-org/infrastructure/pull/997 to hardcode the docs service

consideRatio commented 2 years ago

@sgibson91 by inspection and looking at ...

        # Generate a token for the hub health service
        hub_health_token = hmac.new(
            secret_key, b"health-" + self.spec["name"].encode(), hashlib.sha256
        ).hexdigest()
        # Describe the hub health service
        generated_config.setdefault("jupyterhub", {}).setdefault("hub", {}).setdefault(
            "services", {}
        )["hub-health"] = {"apiToken": hub_health_token, "admin": True}

I'm guessing, that the hub-health token is the deployer scripts way of granting itself permissions to be able to communicate to the hub health endpoint or something? Or what does it do? I'm not sure. I don't think you would need an api token to ask the hub "are you alive" actually. Is the purpose to ask the hub if its alive?

sgibson91 commented 2 years ago

Is the purpose to ask the hub if its alive?

We do ask if it's alive, but we also execute the notebooks in the deployer/tests directory, so it needs to start a server too

https://github.com/2i2c-org/infrastructure/blob/fdfbce56ca42215636faf2a2343caf18a51cfa62/deployer/tests/test_hub_health.py#L62-L76

consideRatio commented 2 years ago

Okay, so its really not really a hub health token, its a jupyterhub token for the deployer script, which will spawn servers and such to verify functionality.

Hmmm... Well, you can generate it and set it explicitly like this, or you could define such service and let it be autogenerated by the z2jh helm chart, and then get it via kubectl get secret (and details to fetch specific field) etc.

sgibson91 commented 2 years ago

Hmmm, but either way we will need to know namespace specifics so it's probably not nicely separable from the deployer. In which case I'll probably leave it as it for now and we can come back to it later if we feel it necessary.

consideRatio commented 2 years ago

@sgibson91 makes sense to me!

Regarding the dask-gateway token. We can make the helm chart expose it, but we need to make sure the dask-gateway server could read the jupyterhub api token its given - for example by mounting a k8s secret as a file or environment variable.

I'll want to look into this a bit on how things work currently in the dask-gateway helm chart and dask-gateway-server software which it runs. I'll update this comment to include a link to a issue in dask/dask-gateway i'll create soon.

EDIT: this is the feature request i made, https://github.com/dask/dask-gateway/issues/473

sgibson91 commented 2 years ago

Right I have gotten https://github.com/2i2c-org/infrastructure/pull/985 into a state that I believe is mergeable. There are outstanding actions that I would've liked to have tackled in this PR, mainly to make sure docs don't get out of sync, but likely isn't worth it given that I'll be away for a week. I'll copy those todos here as well for tracking:

sgibson91 commented 2 years ago

Also, holy 💩 we run a lot of hubs!