Closed choldgraf closed 2 years ago
We should have automated testing infrastructure that will prevent hub deployment if it fails
I think this needs some clarification. Currently, we run automated tests after we have deployed an upgrade to a hub. So this statement could either mean:
Canary deployments: Deploying a hub should be stopped if an automated test does not pass.
Again, clarification here. This should probably say "deploying to production-level hubs is halted if deployment to staging hubs fail"
Adding a new hub's configuration will trigger the deployment of a JupyterHub with that configuration on that cluster.
This already happens since the CI/CD runs the deployer and the deployer uses the --install
flag to create a deployment if it doesn't already exist. But we often don't do it in practice because, if for some reason deployment fails, we can currently only see logs locally, not in CI/CD, as secrets can be leaked.
I think it would also be cool to add test-this-pr
to this project to enable testing on staging hubs from a PR. This fleshes out the canary deployments.
I am very keen to champion this project :smile:
I'd like to be an assistant on this project!
This work requires changes to the deployer configuration file schema. I suggest two phases of this work.
Related to #908
We had an incident where unauthorized users could access a hub because the hub's configuration did not have allowed_users
set (thus, JupyterHub allowed all users).
We should add some kind of validation for certain configuration that we know needs to be set. Maybe this is part of #279 ?
In our latest team meeting it seemed like @sgibson91 and @consideRatio were interested in focusing their efforts on this work for their next major focus area. To begin with they were going to work on https://github.com/2i2c-org/infrastructure/issues/857, and this effort will likely create new ideas for what else needs to be done in the CI/CD.
@consideRatio and @sgibson91 - how do you feel about scoping this effort for two weeks, and then doing a re-assessment and adjusting our scope then? The "related issues" section in the top comment has a few issues that could be tackled here, maybe we pick whatever subset makes sense from those, and try and get that done in the next 2 weeks? We can then decide what should go into subsequent cycles.
I think let's see how far we get in a month and then decide if what's left is worth another cycle?
That sounds good to me - but let's define what work we want to accomplish in that month. Of all the improvements that we could make towards CI/CD, what is a reasonable subset to shoot for? We can adjust over time as we learn more, but we'll be more effective if we have a target to aim towards.
Can @sgibson91 and @consideRatio scope out what they think would be the most impactful and realistic given that time frame?
I think definitely:
I've updated the top comment with these ideas so we can map out our next steps. I wonder if @sgibson91 would be willing to take a look in the top comment and decide if this is a reasonable scope for a one month timebox? We can also adjust the scope up or down depending on how quickly we can implement, but I think it's good to have a target to shoot for at least
https://github.com/2i2c-org/infrastructure/issues/903 is a small fix that I think would be easily tackled as part of this project too
@2i2c-org/tech-team we have outlined a proposal for the refactor of the file structure here: https://hackmd.io/@29EjZnuAQx-ME7Olgi2JXw/Sk5S7xP6F
I like the proposal!! Left a tiny comment but, from the top of my head, I do not see a blocker so I would be +1 on it.
I went through and left a few comments myself as well - in general I think it looks like a pretty clean and well-structured setup. A nice improvement!
Thanks for the comments so far! If there are no strong pushbacks in the next day or so, I'll begin implementing the batches of work outlined in the document.
I am going to start working on the following todos from the hackmd in order:
hub-templates
folder to helm-charts
: https://github.com/2i2c-org/infrastructure/pull/953hubs.*.template
key to be hubs.*.helm_chart
in *.cluster.yaml
: https://github.com/2i2c-org/infrastructure/pull/955helm-charts
folder: https://github.com/2i2c-org/infrastructure/pull/959/config/hubs
to /config/clusters
: https://github.com/2i2c-org/infrastructure/pull/963hubs.*.config
key to be hubs.*.helm_chart_values_files
and separate the helm chart values into individual files./config/clusters
that contains all the helm chart values files for each deployment*.cluster.yaml
file to be a cluster.yaml
file under each cluster folder in /config/cluster
helm template
: https://github.com/2i2c-org/infrastructure/pull/1045support.config
key to support.helm_chart_values_files
and separate out the helm chart values from the cluster configuration: https://github.com/2i2c-org/infrastructure/pull/1047If anyone has opinions on https://github.com/2i2c-org/infrastructure/issues/965 while I'm poking around in the deployer, I'd love to hear
Update from a quick Slack chat with Erik: values files should be defined relative to the location of the cluster.yaml
file, not the location of where the deployer is invoked from
Drafting in here: https://github.com/2i2c-org/infrastructure/pull/985
How to deal with these sections is going to cause a headache. We don't know ahead of time which files contain the placeholder, so I can't see a way around this other than reading in every file in the list until you find the placeholder.
I'm not sure about "staff placeholders" meaning, there is no docstring for this function :cry:. I think though, its templating configuration for various values to add/remove 2i2c staff to various lists? Maybe that should be done in logic as part of the jupyterhub startup routine as a hub.extraConfig entry (like a jupyterhub_config.py snippet) instead?
It is replacing this:
with the list maintained here:
https://github.com/2i2c-org/infrastructure/blob/master/config/clusters/staff.yaml
for all hubs
Okay hmmm. I suggest we remove this logic from the deployer in favor and embedding such logic as a script snippet to run as part of the jupyterhub_config.py via hub.extraConfig
.
custom.2i2c.staffUsers
as a list for all 2i2c staff accountscustom.2i2c.addStaffUsersToAdminUsers
or similar boolean flaghub.extraConfig.2i2c_addStaffUsersToAdminUsers
in basehub, with python code that appends to the c.Authenticator.admin_users
list dynamically.@2i2c-org/tech-team it would be great to have your input on a decision like this!
@2i2c-org/teams/tech-team it would be great to have your input on a decision like this!
Dealing with the staff situation gave me lots of headaches too :( As someone who already tried various ways of implementing this and failed a few times from various reasons (https://github.com/2i2c-org/infrastructure/pull/311), I will dare to suggest to just go with the solution that makes life easier! The placeholder implementation was such type of solution too. So, whatever makes more sense with the new refactored structure.
The script suggestion sounds good :+1: (though I remember trying and failing to implement something similar because I didn't know how to determine which authentication method was chosen (google/github) from within the hub.extraConfig
and also the username_pattern
seemed hard to determine :disappointed:)
Any additional information for the script in hub.extraConfig
can be passed via the custom
section, preferably for example custom.2i2c.<something>
as that would help clarify that this is 2i2c that has given this custom config meaning for anyone down the line to better comprehend what's related to what.
Here is a reduced example of what I'm thinking.
custom:
2i2c:
addStaffUsersToAdminUsers: true
staffUsers:
- GeorgianaElena
- sgibson91
- consideratio
# in practice, we need to have a list of different types of accounts and
# some logic to know what to use, perhaps by inspecting the kind of
# c.JupyterHub.authenticator_class has been configured or via
# via explicit declaration in `custom.2i2c.staffUserListToUse` etc
hub:
extraConfig:
2i2c_addStaffUsersToAdminUsers: |
# ...
staff_users = get_config("custom.2i2c.staffUsers", [])
add_staff_users_to_admin_users = get_config("custom.2i2c.addStaffUsersToAdminUsers", False)
if add_staff_users_to_admin_users:
c.JupyterHub.admin_users.extend(staff_users)
@consideRatio Are you suggesting we add the staff list in basehub too? The benefit of staff.yaml
is that the list only had to be maintained in one place for all staff members to get access to all hubs, and I wanna make sure we preserve that.
Are you suggesting we add the staff list in basehub too?
Yes I suggest both the script (hub.extraConfig...
) and the staff list (custom.2i2c...
) is centralized still, for example via basehub default values or via a dedicated helm chart values file that the various charts reference as one --values <file>
to pass as part of helm upgrade
etc.
Any additional information for the script in hub.extraConfig can be passed via the custom section, preferably for example custom.2i2c.
That's a good idea! However, I still don't see how we can determine the auth type from within the script, without duplicating/moving this info in the custom section (the staff list differs based on the auth type). https://github.com/2i2c-org/infrastructure/blob/701e4a78c46e55c90df453a56b2c25461ac20110/config/clusters/2i2c.cluster.yaml#L24
It might be that I'm missing something though
You are probably right about it being hard to decide what kind of user identity should be used. The authenticator may be against auth0 no matter what, but still different usernames etc should be used.
Since that is a configuration that I understand to be on how the deployer script automatically setups things with auth0, I suggest not entangling it with the choice of user lists anyhow which may not relate to auth0.
Here is a config structure proposal and perhaps functional example to add to the basehub default values, but to be combined with each helm chart configuring their own dedicated values that include setting custom.2i2c.add_staff_user_ids_to_admin_users
and custom.2i2c.add_staff_user_ids_of_type
.
While they could rely on a default value in the basehub helm chart, I think it's better that they don't to increase comprehensibility.
custom:
2i2c:
# Should staff user id's be injected to the admin_users
# configuration of the JupyterHub's authenticator by our
# custom jupyterhub_config.py snippet as declared in hub.extraConfig?
add_staff_user_ids_to_admin_users: false
add_staff_user_ids_of_type: ""
staff_github_ids:
- choldgraf
- consideRatio
- damianavila
- GeorgianaElena
- sgibson91
- yuvipanda
staff_google_ids:
- choldgraf@2i2c.org
- erik@2i2c.org
- damianavila@2i2c.org
- georgianaelena@2i2c.org
- sgibson@2i2c.org
- yuvipanda@2i2c.org
hub:
extraConfig:
2i2c_add_staff_user_ids_to_admin_users: |
add_staff_user_ids_to_admin_users = get_config("custom.2i2c.add_staff_user_ids_to_admin_users", False)
if add_staff_user_ids_to_admin_users:
user_id_type = get_config(f"custom.2i2c.add_staff_user_ids_of_type")
staff_user_ids = get_config(f"custom.2i2c.staff_{user_id_type}_ids", [])
c.JupyterHub.admin_users.extend(staff_user_ids)
I will dare to suggest to just go with the solution that makes life easier! The placeholder implementation was such type of solution too. So, whatever makes more sense with the new refactored structure.
I think that this is the right approach here. I can't speak to the specific proposal because I don't understand the deployer setup well enough, but I recall that in previous conversations, we were trying to balance "what works best for 2i2c" as well as "what makes it easy for others to replicate our setup from a R2R perspective".
At this point, I think it's more important that we make our lives easier and simpler. If it means there is some 2i2c special-casing in our deploying infrastructure, I think that's fine. I think it will be better to enable "Right to Replicate" by making sure the Z2JH / TLJH docs are good, and providing others some guidance in following them, rather than by choosing our own deployment setup so that it is copy/pasteable.
Not sure if this is a helpful comment or not, but thought I'd share in case this was a decision-factor
Hey all - since @sgibson91 is off next week, can we shoot a target for what we'd like to complete on this before the end of this week? Or are there things that @consideRatio would like to work on next week while Sarah is gone?
It seems like these are the main two PRs we've got active right now:
is it realistic to get both of those in by EOD friday?
Or are there things that @consideRatio would like to work on next week while Sarah is gone?
I have a full plate of work things to focus on already for at least two weeks. I'm not sure on the amount of work remaining to get those PRs merged, but I'd be happy to expedite review effort.
I think I should be able to have #988 ready for review/merge today and then can push as far as I can on #985. Theoretically, if there are no more blockers, it should be doable by the end of the week.
https://github.com/2i2c-org/infrastructure/pull/988 is ready for review and I'd like to merge it before continuing with #985 as there will be merge conflicts
Ok, next blocker 😅
This function expects to be able to read the helm chart values in order to see if the docs_service
is enabled and generate a token for it.
The generated_config
never gets explicitly written to the helm chart values file. It is written to a temporary file and passed as another values file instead.
How do we establish which file being passed via helm_chart_values_files
contains the appropriate config for the docs_service
? I'm thinking in the case where we have staging
and prod
, the prod
values file is only providing minor changes to the staging
values file. Is it ok to just assume staging
has it in this case?
How do we establish which file being passed via helm_chart_values_files contains the appropriate config for the docs_service?
I think we must avoid a pattern where we configure the deployer script to look for a specific file. This is too magical and complicated.
I'm thinking of the following options:
helm
itself does, and then we read from the resulting version. Preferably we could ask helm to coalesce the values so its always done in the helm way of doing it, but otherwise it could also be done manually. In other words, we would use a function like "read all helm values files -> coalesce them -> return python object representing that" then we read from that.Can we make helm
do the coalescing of values? Hmm... I know that if you use --debug
when doing helm upgrade
etc, you get to see all values. Can this be used with helm templates
as well? Can we extract it so we get to see only the values coalesced and not the rendered templates? Not sure.
I think what I'd like most, would be if we broke the DRY principle. That would help demystify the deployer script which is too magic too me overall - doing things I don't keep track of it doing.
We have planned for something like...
hubs:
- name: staging
domain: staging.us-central1-b.gcp.pangeo.io
helm_chart: daskhub
auth0:
enabled: false
helm_chart_values_files:
- staging.values.yaml
What I'm thinking would help a lot with comprehensibility of what is done by the deployer script, and this situation, would be if we had like...
hubs:
- name: staging
domain: staging.us-central1-b.gcp.pangeo.io
helm_chart: daskhub
auth0:
enabled: false
helm_chart_values_generated:
docs_service_api_token: true
helm_chart_values_files:
- staging.values.yaml
Is token generation not a thing we could move into a basehub tpl file? Like: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/417e84e81101f5757c5bb735bd7daf249696c45f/jupyterhub/templates/hub/_helpers-passwords.tpl
@sgibson91 ooooh yes! Actually, by not specifying an API token, one is automatically generated by Z2JH! That means, the basehub helm chart can simply reference it directly - and voila, no need to do anything by let it be accessed! :D
I'll see if I have written an example about this already in z2jh docs.
Yes I had! See https://zero-to-jupyterhub.readthedocs.io/en/latest/resources/reference.html#hub-services!
@sgibson91 hmmm... I'm confused though about who is using the token. To have a api_token for a JupyterHub registered service means that that service can be granted permission to speak with JupyterHub, but... I don't see any use of that token by the docs service - the docs service doesn't seem to use any such token...
Does it even need a token? I don't think so.
It could be that this is a legacy of a bug that was fixed, about needing an api_token specified for a service, even though it really shouldn't be needed. So, an option perhaps is to simply not specify such token at all any more if its only created for that purpose.
Found the relevant PR @consideRatio https://github.com/2i2c-org/infrastructure/pull/445 ~I don't think we figure out why we ended up needing it :confused:~
We actually did, @consideRatio just linked the relevant PR in the link above! Thanks Erik :tada:
Ah, the resolution was https://github.com/jupyterhub/jupyterhub/pull/3531, available in jupyterhub version 1.4.2, which is available in z2jh 1.1.0+, and we use version: 1.1.3. We can safely just whipe the api token generation part.
Okay nice! So we just stop generating the api token part. Further, we have it written out to say... "url": f'http://docs-service.{self.spec["name"]}',
where we specify the service by also being explicit about what namespace - but that may not be needed as long as the proxy pod is in the same namespace as the docs service. So, it can be hardcoded to http://docs-service
i think.
And, then, suddenly we have no config related to a secret, and no config related to the namespace etc. Voila - we can hardcode a default value in basehub just like is done for configurator:
We should not specify a command
though for this docs_service, we just want the proxy pod to be configured to proxy traffic to the docs service.
This is amazing work - thank you @consideRatio and @GeorgianaElena!!!
Do you think the tpl files would be a good thing to implement for the hub-health-service token and the dask-gateway api token? If so, we can get rid of the apply_hub_helm_chart_fixes
function completely which I think will greatly reduce complexity!!
I opened https://github.com/2i2c-org/infrastructure/pull/997 to hardcode the docs service
@sgibson91 by inspection and looking at ...
# Generate a token for the hub health service
hub_health_token = hmac.new(
secret_key, b"health-" + self.spec["name"].encode(), hashlib.sha256
).hexdigest()
# Describe the hub health service
generated_config.setdefault("jupyterhub", {}).setdefault("hub", {}).setdefault(
"services", {}
)["hub-health"] = {"apiToken": hub_health_token, "admin": True}
I'm guessing, that the hub-health token is the deployer scripts way of granting itself permissions to be able to communicate to the hub health endpoint or something? Or what does it do? I'm not sure. I don't think you would need an api token to ask the hub "are you alive" actually. Is the purpose to ask the hub if its alive?
Is the purpose to ask the hub if its alive?
We do ask if it's alive, but we also execute the notebooks in the deployer/tests
directory, so it needs to start a server too
Okay, so its really not really a hub health token, its a jupyterhub token for the deployer script, which will spawn servers and such to verify functionality.
Hmmm... Well, you can generate it and set it explicitly like this, or you could define such service and let it be autogenerated by the z2jh helm chart, and then get it via kubectl get secret (and details to fetch specific field)
etc.
Hmmm, but either way we will need to know namespace specifics so it's probably not nicely separable from the deployer. In which case I'll probably leave it as it for now and we can come back to it later if we feel it necessary.
@sgibson91 makes sense to me!
Regarding the dask-gateway token. We can make the helm chart expose it, but we need to make sure the dask-gateway server could read the jupyterhub api token its given - for example by mounting a k8s secret as a file or environment variable.
I'll want to look into this a bit on how things work currently in the dask-gateway helm chart and dask-gateway-server software which it runs. I'll update this comment to include a link to a issue in dask/dask-gateway i'll create soon.
EDIT: this is the feature request i made, https://github.com/dask/dask-gateway/issues/473
Right I have gotten https://github.com/2i2c-org/infrastructure/pull/985 into a state that I believe is mergeable. There are outstanding actions that I would've liked to have tackled in this PR, mainly to make sure docs don't get out of sync, but likely isn't worth it given that I'll be away for a week. I'll copy those todos here as well for tracking:
Also, holy 💩 we run a lot of hubs!
This is a pitch for a multi-sprint project - it uses a slightly different structure so that we can test out how it feels!
Problem statement
We have some basic hub deployment infrastructure set up. However this is missing a few important pieces of functionality:
Proposed solution
We should improve our hub CI/CD infrastructure with a few specific focus areas:
Here's a rough idea of how this could work:
Guidance for implementation
A few things that must exist in the final product:
Things to test for
A few ideas to include in testing infrastructure:
No gos / out-of-scope
Related issues
These are related issues, and some (or all) may be in-scope to close as part of this project
Updates and ongoing work
2022-01-19
We had a team meeting today to agree on a few next steps for @sgibson91 and @consideRatio to tackle some of this. Here's a rough idea of what we'd like to get done:
I believe that @sgibson91 is championing this with @consideRatio providing support and collaboration!