2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

[EPIC] Good Enough Cloud Cost Billing Engineering Automation #3958

Closed yuvipanda closed 4 months ago

yuvipanda commented 7 months ago

The goal of this epic is to implement just enough cloud billing automation that we can be unstressed for another year. It is derived from https://github.com/2i2c-org/infrastructure/issues/3764, and understanding of CS&S's requirements. This should be 'good enough' for at least the next 12 months.

Good Enough Automation Process

There are two parts to sending a cloud bill:

  1. Figuring out how much to bill (2i2c Engineering)
  2. Sending the bill (CS&S)

Both these are currently manual. The goal for the 'good enough automation' is to automate (1), and not touch (2) at all.

Desired Process

The following is the process that is desired

  1. In the first week of each quarter, an engineer is assigned to read a runbook (step by step document) that describes what they should do
  2. This runbook directs them to run a few commands in the infrastructure repository, which will generate a few new Google Sheet links with all the cloud costs to be recovered (both shared and dedicated)
  3. This runbook directs them to share these links with CS&S through a well defined mechanism (email to a known address, or slack)

This epic is going to be about building this runbook, as well as the commands that will generate this information.

Current status

Currently, we have some code in https://github.com/2i2c-org/infrastructure/tree/main/deployer/commands/generate/billing that generates billing information for dedicated GCP clusters and shared GCP clusters. This updates an existing Google Sheet. This needs to be cleaned up and better set up.

Subtasks

We have two axes here - cluster type (dedicated vs shared), as well as cloud provider (AWS vs GCP). We can pretend Azure doesn't exist for now.

Dedicated Clusters on GCP

This should be the first task, as we already have a lot of good infrastructure for this.

Definition of done

Upon running a deployer command deployer generate cost-table gcp --start-month <month> --end-month <month>, it outputs link to a google sheet. This sheet is accessible to anyone with a 2i2c.org account. It has the following information as columns:

  1. Name of cluster (fetched from cluster.yaml)
  2. One column for each month starting from start-month to end-month
  3. A Total column

Work to be done

We mostly have this working. I think we'll need to change this to output a new Google Sheet instead of re-using the same one.

Dedicated Clusters on AWS

This should be the second task, as it's equivalent to what we have for GCP. But it's more work, as we haven't done this at all.

Definition of done

Upon running a deployer command deployer generate cost-table aws --start-month <month> --end-month <month>, it outputs link to a google sheet. This sheet is accessible to anyone with a 2i2c.org account. It has the following information as columns:

  1. Name of cluster
  2. One column for each month starting from start-month to end-month
  3. A Total column

Work to be done

https://docs.aws.amazon.com/cur/latest/userguide/what-is-data-exports.html is what we'll use to programmatically access this information. I've just enabled it today, but this needs to be investigated and figured out.

Shared clusters on GCP

To be refined

Shared clusters on AWS

To be refined.

Tasks

- [x] @consideRatio finishing up AWS shared hub costs (with per-nodepool cost split included)
- [x] @consideRatio transfers the final $$$ amounts for the AWS shared hubs to the cloud billing sheet
- [x] @consideRatio spends a timeboxed (30min) amount of time validating all the shared cloud bills (across GCP & AWS) before marking them 'done to our best effort'
- [x] @consideRatio will make a short video describing and demo'ing how shared billing is done for a GCP shared hub *for april*. Since our current 'automation' is a google sheet, a focused video seemed the appropriate path forward.
- [x] @consideRatio will repeat the same ^, but for AWS.
- [x] @consideRatio will add to our docs, links to all the existing co-working sessions, along with the two new videos produced as part of this - https://github.com/2i2c-org/infrastructure/pull/4023
- [x] @consideRatio will schedule a pairing session with someone for mid-June (after team meeting), to do the numbers for May. (Erik has scheduled a loud reminder to be schedule a meet scheduling something early May is too complicated)
- [x] @consideRatio will schedule a pairing session that does *not* involve him for **first week of july** to make sure the numbers actually go to CS&S (Erik has scheduled a loud reminder to be schedule a meet scheduling something early May is too complicated)
- [x] @sgibson91 will finish #3989
- [ ] https://github.com/2i2c-org/infrastructure/issues/3761
- [ ] https://github.com/2i2c-org/infrastructure/issues/3711
yuvipanda commented 6 months ago

Current progress on this is:

yuvipanda commented 6 months ago

Me and @consideRatio spent more time on this today, and finished the AWS shared hub costs as well.

Next steps here are:

steps are relocated, see top post of issue - [ ] @consideRatio finishing up AWS shared hub costs (with per-nodepool cost split included) - [ ] @consideRatio transfers the final $$$ amounts for the AWS shared hubs to the cloud billing sheet - [ ] @consideRatio spends a timeboxed (30min) amount of time validating all the shared cloud bills (across GCP & AWS) before marking them 'done to our best effort' - [ ] @consideRatio will make a short video describing and demo'ing how shared billing is done for a GCP shared hub *for april*. Since our current 'automation' is a google sheet, a focused video seemed the appropriate path forward. - [ ] @consideRatio will repeat the same ^, but for AWS. - [ ] @consideRatio will add to our docs, links to all the existing co-working sessions, along with the two new videos produced as part of this - [ ] @consideRatio will schedule a pairing session with someone for mid-June (after team meeting), to do the numbers for May. - [ ] @consideRatio will schedule a pairing session that does *not* involve him for **first week of july** to make sure the numbers actually go to CS&S - [ ] @sgibson91 will finish #3989

This will allow someone else to pair with @consideRatio to do the $$$ for May, and then that person can pair with yet another person to do that for June. This would allow us to have a semi-automated process that ensures that CS&S can get these numbers within one week of end of quarter.

yuvipanda commented 6 months ago

I've moved the tasklist to the body of the issue.

haroldcampbell commented 6 months ago

Currently blocked and waiting on https://github.com/2i2c-org/infrastructure/issues/3989

sgibson91 commented 5 months ago

3989 is now complete

haroldcampbell commented 5 months ago

@yuvipanda can we get some insight into how to close this card?

haroldcampbell commented 5 months ago

@yuvipanda what's the action required to close this?

yuvipanda commented 5 months ago

Per sprint planning meeting, @haroldcampbell is going to make choices here, and ask me specific questions if needed.

haroldcampbell commented 4 months ago

Depends on completing issue https://github.com/orgs/2i2c-org/projects/49/views/1?filterQuery=-allocation%3A%22myst-md%22&pane=issue&itemId=55740909

Gman0909 commented 4 months ago

This has now been blocked for three weeks - is there anything we can do to move it forward?

yuvipanda commented 4 months ago

Enough people have now completed this work, and we currently have no further work planned here. With that, I'm going to close this one.