2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

Enhance the `deployer` to make us more efficient at deploying hubs #1917

Closed GeorgianaElena closed 1 year ago

GeorgianaElena commented 1 year ago

I. Context

Now that the deployer has been refactored https://github.com/2i2c-org/infrastructure/pull/1869, building on top of it has become more scalable.

We are currently using the deployer to automatize various manual tasks related to hub deployment and management and with each option added to it, the engineer team have become more efficient 👉🏼 https://github.com/2i2c-org/infrastructure/blob/master/deployer/README.md

The deployer is a tool to help us, but is too over-powered for someone not in our engineering team to use. The manual steps are important because that is how a hub admin would setup the hub without 2i2c involvement, from a Right to Replicate perspective. (credit goes to Sarah for signal boosting this aspect)

Personal experience

After deploying two similar clusters and hubs on aws, I manage to build some muscle memory and the time it took to deploy the second one was cut in half. But this still meant a day, which in my opinion is a lot.

I believe a significant amount of time was spent on:

  1. context switching between following the docs and tasks done in the terminal, editor and various UI's (aws, namecheap, Grafana, GitHub)
  2. copying config over from other similar hubs, making sure you've enabled all of the features requested and also, not introducing unneeded customizations

II. Proposal

The deployer is ready for a new round of enhancements, maintenance and documentation in order to help us become more efficient at deploying new clusters and hubs, but also form the r2r perspective to remove any config generation out of it, while creating and maintaining a clear documentation of the deploy workflow.

III. Specific action points mapped to the three categories above

1. ENHANCEMENTS

deployer.generate_cluster

- [x] Create template files for the **terraform** infrastructure files and add support to `deployer.generate_cluster` command to generate them https://github.com/2i2c-org/infrastructure/pull/1903
- [x] https://github.com/2i2c-org/infrastructure/pull/1937
- [ ] https://github.com/2i2c-org/infrastructure/issues/2187

deployer.generate_hub

- [ ] https://github.com/2i2c-org/infrastructure/issues/1798
- [ ] Add a `deployer.generate_hub` command that uses the templates above and generates the files needed for adding a hub of one of those types to an existing cluster

Automatize the manual UI tasks that we're doing right now

- [x] **managing grafana tokens**:https://github.com/2i2c-org/infrastructure/pull/1938
- [ ] **managing namecheap domains** ([for ref](https://www.namecheap.com/support/api/methods/))
- [ ] **managing GitHub OAuth apps** (have to check if this is possible, checkout [this](https://www.7pace.com/blog/dynamic-oauth-application-github-enterprise-integrations))

2. MAINTENANCE and R2R

- [ ] https://github.com/2i2c-org/infrastructure/issues/1472
- [ ] https://github.com/2i2c-org/infrastructure/issues/1925
- [ ] https://github.com/2i2c-org/infrastructure/issues/1924
- [ ] https://github.com/2i2c-org/infrastructure/issues/2024
- [ ] https://github.com/2i2c-org/infrastructure/issues/970

3. DOCUMENTATION

sgibson91 commented 1 year ago

FYI, I don't think it's actually possible to programmatically create GitHub OAuth Apps from any sort of command line/REST API call - that is why we didn't broadly advertise the Teams/Orgs auth feature for a long time because it inherently requires manual setup and we wanted to judge how intrusive it was before 'going public'. However, I don't think it's that intrusive and it is such a useful feature (especially with teams-based restrictions on profiles) that I think it's worth eating the manual setup cost.

In general, I'm cautious about getting the deployer to automatically create everything for us. If the steps are "input a command -> use the output of a command", there's not a lot of motivation to learn what the command does. This is why I found the "create cluster" files section of the AWS setup process to be confusing. I trust the jsonnet files that the deployer spits out to be "correct", and if when I run them something is wrong, I don't have a good starting position to begin debugging. This is because the deployer "does magic" that I am not informed about. Only the person who wrote that part of the deployer knows about that magic.

I'd honestly just be happy with a folder of template files (tfvars, helm chart values, etc) that I hand copy and hand edit. Rather than increasing the complexity of the deployer to "do magic".

I also think about this from a Right to Replicate perspective. The deployer is a tool to help us, but is too over-powered for someone not in our engineering team to use. The manual steps are important because that is how a hub admin would setup the hub without 2i2c involvement. I would prefer us to get our docs up to a state that the manual steps "flow" (which was my intention with the Hub Deployment Guide) and then work that up to some R2R docs that wouldn't need the deployer at all.

damianavila commented 1 year ago

This is a super important discussion that I think will benefit from some sync time. I added this one as a topic for the next week's Prod and Eng meeting on Tue 22nd.

yuvipanda commented 1 year ago

I agree that blindly trusting the output of a program is hard, as when it breaks it is very difficult to see why. I do think however, that @GeorgianaElena's work in https://github.com/2i2c-org/infrastructure/pull/1903 with GCP is actually a lot better than my current work with the jsonnet files - primarily because they are generating tfvar files that are providing terraform variables that are documented in variables.tf and aren't doing a lot of magic, while jsonnet is instead a few more layers of magic (translated into yaml, then used by eksctl to do random things that require a deeper understanding of EKS). I think the problem there is more our use of eksctl, which isn't integrated with terraform, rather than the generator itself.

I have opened https://github.com/2i2c-org/infrastructure/issues/1924 to get rid of the jsonnet. I believe even if we had copy pasted the jsonnet from a template, it wouldn't have made much of a difference in understanding, and we should really get rid of it :) And I agree we should keep an eye on making sure there isn't a lot of magic in the deployer generate commands, and it could always be manually generated too. This will also help a lot with the right to replicate parts, as once a .tfvars or .common.yaml is generated, the fact that it was generated and not copy pasted ceases to matter. I think we should get rid of all the config generation we do in the deployer for this reason.

Either way, I hope @GeorgianaElena's work in https://github.com/2i2c-org/infrastructure/issues/1924 can continue :)

yuvipanda commented 1 year ago

I also opened https://github.com/2i2c-org/infrastructure/issues/1925 to remove all the helm related magic in our deployer, so it functions equivalent to a helm upgrade command passed a list of appropriate yaml files.

Another option to consider (but perhaps not block #1924 on) is to use https://github.com/cookiecutter/cookiecutter instead of writing our own.

yuvipanda commented 1 year ago

Finally, I think being able to consistently stand up a new hub in under 1h of human work is an awesome goal to shoot for in terms of reducing our own toil, and increasing automation there without falling prey to magic that we don't know how to fix when broken is definitely doable!

sgibson91 commented 1 year ago

Right, I just want us to recognise that the relationship between automation and efficiency is not necessarily a linear one, and they're not even the only two variables in the equation. We can improve the situation in other ways as well as making the deployer do certain steps for us.

sgibson91 commented 1 year ago

I'm happy to see in #1903 the decision was made to have template files that the deployer copies, renames, moves, reads, and writes. I was worried that these templates would become embedded in the Python files as mega-strings and would therefore reduce findability of the templates, especially for those who don't regularly use the deployer. (I don't even think the engineering team regularly inspects the deployer code, so I don't want us embedding knowledge there.)

This implementation detail alone reduces my concerns a lot.

yuvipanda commented 1 year ago

Yay, so glad to hear @sgibson91 :) I agree completely that keeping it as files and outside python is very important. That is also how the current aws jsonnet generator works - https://github.com/2i2c-org/infrastructure/blob/master/eksctl/template.jsonnet is the file being used as the source, with some template generation.

Hopefully with https://github.com/2i2c-org/infrastructure/issues/1925 we'll remove all of the hub config that's embedded in the deployer.

damianavila commented 1 year ago

I added this one as a topic for the next week's Prod and Eng meeting on Tue 22nd.

I removed it from the meeting agenda because I believe we have an agreement and a path forward I really like!!

damianavila commented 1 year ago

We discussed making this issue a more fine-grained one (probably as part of the new goals for Q1).

GeorgianaElena commented 1 year ago

@2i2c-org/engineering, I updated the top comment with some more concrete action points and a categorization of the tasks. Some of the tasks, esp the ones related to creating templates of some of the files and integrating those in the deployer might need more discussion in separate issues that are not yet created.

I believe the next tasks are now to:

pnasrat commented 1 year ago

If it is helpful Im happy to discuss and share my initial experiences/expectations re deployer as a new engineer. There are definite things that have come to mind in the half a day I've been using it and am sure to have more!

GeorgianaElena commented 1 year ago

That would be extremely helpful @pnasrat! Do you want to open an issue to sketch these ideas and then have a sync chat about it, the other way around, or how do you prefer?

We could also use the Product and engineering meeting to discuss this if you'd like. There's one every Tuesday https://compass.2i2c.org/en/latest/reference/calendar.html, including today and the agenda is here.

GeorgianaElena commented 1 year ago

This was an issue tracking a quarterly goal from some time ago. Some tasks have been done, some are still todo, but since most of them are tracked into their own issues, I will close this issue.