2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
106 stars 65 forks source link

[EPIC] Support attributing costs to individual hubs automatically on Openscapes #4453

Closed yuvipanda closed 2 months ago

yuvipanda commented 4 months ago

As part of [Initiative] Hub Scale Cost Monitoring #4384, we want to support attributing costs to individual hubs on AWS.

We don't want to do this on all hubs on all clusters, but need to pick a cluster that has multiple hubs already in it to attribute costs. Let's pick openscapes - it has a staging and prod hub, but also a workshop hub!

While this EPIC is focused on openscapes, at the end of it, it would allow us to know exactly what we would need to do to do the same on any other cluster.

### Tasks
- [ ] https://github.com/2i2c-org/infrastructure/issues/4461
- [ ] https://github.com/2i2c-org/infrastructure/issues/4475
- [ ] https://github.com/2i2c-org/infrastructure/issues/4464
- [ ] https://github.com/2i2c-org/infrastructure/issues/4465
- [ ] https://github.com/2i2c-org/infrastructure/issues/4482
- [ ] https://github.com/2i2c-org/infrastructure/issues/4485
- [ ] https://github.com/2i2c-org/infrastructure/issues/4486
- [ ] https://github.com/2i2c-org/infrastructure/issues/4473
- [ ] https://github.com/2i2c-org/infrastructure/issues/4525
- [ ] https://github.com/2i2c-org/infrastructure/issues/4502
- [ ] https://github.com/2i2c-org/infrastructure/issues/4546
- [ ] https://github.com/2i2c-org/infrastructure/issues/4648
- [ ] https://github.com/2i2c-org/infrastructure/issues/4667
- [ ] https://github.com/2i2c-org/infrastructure/issues/4544
- [ ] https://github.com/2i2c-org/infrastructure/issues/4510
- [ ] https://github.com/2i2c-org/infrastructure/issues/4511
- [ ] https://github.com/2i2c-org/infrastructure/issues/4523
- [ ] https://github.com/2i2c-org/infrastructure/issues/4524
- [ ] https://github.com/2i2c-org/infrastructure/issues/4670
- [ ] https://github.com/2i2c-org/infrastructure/issues/4671
- [ ] https://github.com/2i2c-org/infrastructure/issues/4672
- [ ] https://github.com/2i2c-org/infrastructure/issues/4673
- [ ] https://github.com/2i2c-org/infrastructure/issues/4677
- [ ] https://github.com/2i2c-org/infrastructure/issues/4714
- [ ] https://github.com/2i2c-org/infrastructure/issues/4711
- [ ] https://github.com/2i2c-org/infrastructure/issues/4710
- [ ] https://github.com/2i2c-org/infrastructure/issues/4713
- [ ] https://github.com/2i2c-org/infrastructure/issues/4712
- [ ] https://github.com/2i2c-org/infrastructure/pull/4739
- [ ] https://github.com/2i2c-org/infrastructure/pull/4740
- [ ] https://github.com/2i2c-org/infrastructure/pull/4741
- [ ] https://github.com/2i2c-org/infrastructure/pull/4742
- [ ] https://github.com/2i2c-org/infrastructure/pull/4744
- [ ] https://github.com/2i2c-org/infrastructure/issues/4789
- [ ] https://github.com/2i2c-org/infrastructure/issues/4788
- [ ] https://github.com/2i2c-org/infrastructure/issues/4787
- [ ] https://github.com/2i2c-org/infrastructure/issues/4784
- [ ] https://github.com/2i2c-org/infrastructure/issues/4785
- [ ] https://github.com/2i2c-org/infrastructure/issues/4786
- [ ] https://github.com/2i2c-org/infrastructure/issues/4668
- [ ] https://github.com/2i2c-org/infrastructure/issues/4790
- [ ] https://github.com/2i2c-org/infrastructure/issues/4850
- [ ] https://github.com/2i2c-org/infrastructure/pull/4863
- [ ] https://github.com/2i2c-org/infrastructure/pull/4864
- [ ] https://github.com/2i2c-org/infrastructure/pull/4865
- [ ] https://github.com/2i2c-org/infrastructure/pull/4867
- [ ] https://github.com/2i2c-org/infrastructure/issues/4791
- [ ] docs: write docs on enabling this
- [ ] chart: Declare resource requests and limits
- [ ] chart: Use gunicorn instead of flask to run the flask application
### To meet the definition of done
- [x] EDIT: I'll document this limitation instead of handling it due to the complexity it was found to introduce. It may never be sufficiently important to get this done anyhow I've concluded. --- For PVCs, connect the costs for dynamically created storage disks in namespace X to an `2i2c:hub-name` value if there is a match
- [x] Avoid error requests to Cost Explorer API stemming from requesting data from a future month
- [x] Avoid error requests to Cost Explorer API stemming from requesting data from a month too long ago
- [x] Either handle pagination, or assert no extra data is omitted at least by raising an error instead

Definition of done

yuvipanda commented 4 months ago

For storage costs, we will switch to one EFS per hub. This doesn't particularly have cost implications, because AWS EFS is per use.

I was going to suggest we move to multiple nodepools for cost monitoring, but turns out AWS actually has done a pretty decent job of 'splitting costs' per namespace! https://aws.amazon.com/blogs/aws-cloud-financial-management/improve-cost-visibility-of-amazon-eks-with-aws-split-cost-allocation-data/. I'll have a spike specc'd out soon to determine how to do this.

yuvipanda commented 4 months ago

The spike was completed in https://github.com/2i2c-org/infrastructure/issues/4453, with the outcome that:

  1. We can use AWS Athena for these queries, so yay.
  2. We can not use the split cost allocation feature, because it doesn't cover a couple of resources important to us (disk, primarily)
  3. For clusters where we want to offer 'per hub cost tracking', this means each hub must be on its own tagged nodepool.

I've refined and added tasks to move each hub to its own dedicated nodepool.

ateucher commented 3 months ago

This is great @yuvipanda - let me know how I can help!

yuvipanda commented 3 months ago

Instead of drilling down this further, I have written out a more detailed definition of done, and will work with @consideratio in having him do just enough refinement to complete the tasks.

Definition Of done

There exists a grafana dashboard that looks like this:

image

Details

Numbers in purple indicate priority ordering, helpful for scoping conversations.

Fixed costs include core nodepool, any PV needed for support chart or hub databases. Kubernetes master API costs and cost for any load balancer services if they lost money). Note that tagging the EKS cluster itself requires recreating it, which we don’t wanna do. Other active tags can be used to include that information though.

Object storage is all S3 related cost from the scratch and persistent buckets, not counting requestor pays.

"Compute" is all ec2 cost, including root disks, networking and gpu.

Home directory should include home directory and backup costs.

Total cost should include all 2i2c managed infrastructure.

Validation

Each of these graphs need to be validated so we can trust them and find pieces we have missed, as well as spot bugs in the Athena query.

  1. Sum of time series in graphs 1 and 2 should equal graph 4, since summing cost of each hub + fixed cost or each component should yield total cost of 2i2c managed infrastructure
  2. Sum of time series in graph 3 for each hub should equal the hub’s value in graph 1.
  3. Each graph should have a written description of how the AWS cost reporting UI can be used to get the same values we have here
  4. For openscapes, graph 4 should mostly match total cloud spend, although they do have some coiled usage.

Timeline

I would like this to be done within the next 3 sprints (so 2 full sprints with Erik available). We can cut scope as needed.

Next steps

yuvipanda commented 3 months ago

@ateucher today pointed me to https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api.html, which I had totally missed while doing https://github.com/2i2c-org/infrastructure/issues/4465. I think the lesson for me is that I should hand off at the level in https://github.com/2i2c-org/infrastructure/issues/4453#issuecomment-2298076415 earlier, and rely on others to do such spikes.

Regardless, I think it's early enough that we should investigate this alternative to Athena.

It would involve:

  1. https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api.html as the source of data.
  2. An intermediate python web server, that talks to the Cost Explorer API
  3. https://grafana.com/grafana/plugins/yesoreyeram-infinity-datasource/ for connecting this from Grafana. This is recommended by grafana as the replacement for https://github.com/grafana/grafana-json-datasource

There are a few major advantages over using Athena:

  1. Much easier to validate, as we aren't writing complex SQL queries but translating what we can visually do in the cost explorer into API calls.
  2. Athena is not per AWS account but at the AWS organization level, so we would have needed an intermediate layer anyway for cases when we use the 2i2c AWS organization. We wouldn't have needed this for Openscapes, but trying to use it for any of our other AWS accounts would've required an intermediate python layer for access control (so different communities can't see ach other's data).

So if possible, we should prefer this method.

We can resuse all the work we had done, except for some parts of https://github.com/2i2c-org/infrastructure/issues/4546.

Next step here is to design a spike to validate this (instead of https://github.com/2i2c-org/infrastructure/issues/4544). The athena specific issues that are subtasks of this can be closed if we are going to take this approach.

Instead of doing the refinement work myself, I'm going to take a slightly different approach here, and not write out the spike myself. Instead I'll work with @consideRatio in helping him both scope out and accomplish this work.

yuvipanda commented 3 months ago

It does have this limitation:

The Cost Explorer API can access up to 13 months of historical data and data for the current month. It can also provide 3 months of cost forecast data at the daily level of granularity and 12 months of cost forecast data at the monthly level of granularity.

While athena does not.

consideRatio commented 2 months ago

While working #4713 and #4712, I've taken these notes:

Summary

Notes

Wanted accounting details

Overview of tags

Use of the AWS tag editor helped figure these things out: https://us-west-2.console.aws.amazon.com/resource-groups/tag-editor/find-resources.

aws:eks:cluster-name=<cluster-name>

kubernetes.io/cluster/<cluster-name>=owned

alpha.eksctl.io/cluster-name=<cluster-name>

2i2c:hub-name=<namespace>

2i2c:node-purpose=<any value>

2i2c.org/cluster-name=<cluster-name>

ManagedBy=2i2c

Accounting for known 2i2c infra total

Based on a given cluster name, such as openscapeshub, the known 2i2c infra total can be calculated using the tag filter:

Still not accounted costs

These costs for openscapes August month 2024, greater than 1 USD, aren't accounted for yet in openscapes:

USW2-PublicIPv4:InUseAddress: $11.68

We have public IPs from three sources:

Public IPs costs $0.005/hour, so this becomes 24*0.005 == 0.12 per public IP constantly used during a day, and I saw that the cost for a recent Sunday was 0.36, so three IPs aren't paid for it seems.

My guess is that we aren't attributing costs for the NAT Gateway IP, or the k8s Service resources of type LoadBalancers' associated public IPs.

eksctl config doesn't help us get the network interface tagged for the NAT gateway, and I'm not sure how to make the AWS specific k8s controller running in EKS managed control plane provide a tag for the Public IP associated network interfaces either.

USW2-WarmStorage-ByteHrs-EFS: $3.76

This seems associated with backup, because there is a concept between warm / cold there.

We have an automated backup vault, but it isn't tagged by anything. At the same time, we didn't create this vault and it can be used by other people. We did create a job to schedule backups to get done etc. The "restore point" resources in the vault are tagged.

Anyhow, I think this isn't worth further investigation.

Accounting for hub attributed costs

  1. Filter by known 2i2c infra costs, and group by 2i2c:hub-name tag

    NOTE: We could also try group by kubernetes.io/created-for/pvc/namespace, but for now we avoid this complexity and treat all storage volumes as shared costs. Almost all storage costs stems from the prometheus server though, which lives in support namespace anyhow and not a hub specific namespace.

  2. Track the remaining hub unattributed costs separatly

Accounting for hub attributed costs

Like for hub attributed costs, but also grouping by service types and then combining various service types into user friendlier categories.

consideRatio commented 2 months ago

This is now in a sufficiently functional state for openscapes people to start looking at I think. It can be viewed at https://grafana.openscapes.2i2c.cloud/d/edw06h7udjwg0b/cloud-cost-attribution?orgId=1.

openscapes-cost-attribution-is-up

consideRatio commented 2 months ago

Closing as completed, this is functional for openscapes, documentation on scaling this to other hubs was something I considered not to be part of the openscapes focused epic when being asked to provide a definition of done for this. Such future steps are now tracked in #4872.