Closed yuvipanda closed 2 months ago
For storage costs, we will switch to one EFS per hub. This doesn't particularly have cost implications, because AWS EFS is per use.
I was going to suggest we move to multiple nodepools for cost monitoring, but turns out AWS actually has done a pretty decent job of 'splitting costs' per namespace! https://aws.amazon.com/blogs/aws-cloud-financial-management/improve-cost-visibility-of-amazon-eks-with-aws-split-cost-allocation-data/. I'll have a spike specc'd out soon to determine how to do this.
The spike was completed in https://github.com/2i2c-org/infrastructure/issues/4453, with the outcome that:
I've refined and added tasks to move each hub to its own dedicated nodepool.
This is great @yuvipanda - let me know how I can help!
Instead of drilling down this further, I have written out a more detailed definition of done, and will work with @consideratio in having him do just enough refinement to complete the tasks.
There exists a grafana dashboard that looks like this:
Numbers in purple indicate priority ordering, helpful for scoping conversations.
Fixed costs include core nodepool, any PV needed for support chart or hub databases. Kubernetes master API costs and cost for any load balancer services if they lost money). Note that tagging the EKS cluster itself requires recreating it, which we don’t wanna do. Other active tags can be used to include that information though.
Object storage is all S3 related cost from the scratch and persistent buckets, not counting requestor pays.
"Compute" is all ec2 cost, including root disks, networking and gpu.
Home directory should include home directory and backup costs.
Total cost should include all 2i2c managed infrastructure.
Each of these graphs need to be validated so we can trust them and find pieces we have missed, as well as spot bugs in the Athena query.
I would like this to be done within the next 3 sprints (so 2 full sprints with Erik available). We can cut scope as needed.
@ateucher today pointed me to https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api.html, which I had totally missed while doing https://github.com/2i2c-org/infrastructure/issues/4465. I think the lesson for me is that I should hand off at the level in https://github.com/2i2c-org/infrastructure/issues/4453#issuecomment-2298076415 earlier, and rely on others to do such spikes.
Regardless, I think it's early enough that we should investigate this alternative to Athena.
It would involve:
There are a few major advantages over using Athena:
So if possible, we should prefer this method.
We can resuse all the work we had done, except for some parts of https://github.com/2i2c-org/infrastructure/issues/4546.
Next step here is to design a spike to validate this (instead of https://github.com/2i2c-org/infrastructure/issues/4544). The athena specific issues that are subtasks of this can be closed if we are going to take this approach.
Instead of doing the refinement work myself, I'm going to take a slightly different approach here, and not write out the spike myself. Instead I'll work with @consideRatio in helping him both scope out and accomplish this work.
It does have this limitation:
The Cost Explorer API can access up to 13 months of historical data and data for the current month. It can also provide 3 months of cost forecast data at the daily level of granularity and 12 months of cost forecast data at the monthly level of granularity.
While athena does not.
While working #4713 and #4712, I've taken these notes:
Accounting for known 2i2c infra total
below for that.2i2c:hub-name
tag.Use of the AWS tag editor helped figure these things out: https://us-west-2.console.aws.amazon.com/resource-groups/tag-editor/find-resources.
aws:eks:cluster-name=<cluster-name>
kubernetes.io/cluster/<cluster-name>=owned
, because that
includes all costs captured by this tag as well.kubernetes.io/cluster/<cluster-name>=owned
This is a critical tag, because we won't have other tags for dynamically created resources such as EBS storage volumes, ELB load balancers, and potentially other things.
If we aren't to use this, it would make sense to try configure the aws-ebs-csi-driver addon to provide extra tags for the volumes (https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master), but this fails to capture the load balancers for example.
It seems like a good call to instead lean on this tag to capture dynamically created resources by various AWS specific k8s controllers.
alpha.eksctl.io/cluster-name=<cluster-name>
2i2c:hub-name=<namespace>
kubernetes.io/cluster/<cluster-name>=owned
, but
does not fully cover it.2i2c:hub-name
tags cost incurring resources entirely untagged by
kubernetes.io/cluster/<cluster-name>=owned
, such as:
2i2c:node-purpose=<any value>
alpha.eksctl.io/cluster-name=<cluster-name>
for example, so we only
capture them via tags applied to our node groups. Due to that, we need to
include 2i2c:node-purpose
as well for now to capture 2i2c infra costs.2i2c.org/cluster-name=<cluster-name>
2i2c:node-purpose
which is narrowly scoped.ManagedBy=2i2c
2i2c.org/cluster-name
will be
used for new hubs.Based on a given cluster name, such as openscapeshub
, the known 2i2c infra
total can be calculated using the tag filter:
alpha.eksctl.io/cluster-name=<cluster-name>
kubernetes.io/cluster/<cluster-name>=owned
2i2c.org/cluster-name
(for openscapes this needs to be 2i2c:node-purpose=<any value>
until k8s upgrades re-creates all nodes)2i2c:hub-name=<any value>
These costs for openscapes August month 2024, greater than 1 USD, aren't accounted for yet in openscapes:
USW2-PublicIPv4:InUseAddress: $11.68
We have public IPs from three sources:
alpha.eksctl.io/cluster-name
.kubernetes.io/cluster/<cluster-name>=owned
, but network interfaces of that
LB aren't tagged. I expect this to incur cost we fail to track.Public IPs costs $0.005/hour, so this becomes 24*0.005 == 0.12 per public IP constantly used during a day, and I saw that the cost for a recent Sunday was 0.36, so three IPs aren't paid for it seems.
My guess is that we aren't attributing costs for the NAT Gateway IP, or the k8s Service resources of type LoadBalancers' associated public IPs.
eksctl
config doesn't help us get the network interface tagged for the NAT
gateway, and I'm not sure how to make the AWS specific k8s controller running in
EKS managed control plane provide a tag for the Public IP associated network
interfaces either.
USW2-WarmStorage-ByteHrs-EFS: $3.76
This seems associated with backup, because there is a concept between warm / cold there.
We have an automated backup vault, but it isn't tagged by anything. At the same time, we didn't create this vault and it can be used by other people. We did create a job to schedule backups to get done etc. The "restore point" resources in the vault are tagged.
Anyhow, I think this isn't worth further investigation.
Filter by known 2i2c infra costs, and group by 2i2c:hub-name
tag
NOTE: We could also try group by kubernetes.io/created-for/pvc/namespace, but
for now we avoid this complexity and treat all storage volumes as
shared costs. Almost all storage costs stems from the prometheus server
though, which lives in support
namespace anyhow and not a hub
specific namespace.
Track the remaining hub unattributed costs separatly
Like for hub attributed costs, but also grouping by service types and then combining various service types into user friendlier categories.
This is now in a sufficiently functional state for openscapes people to start looking at I think. It can be viewed at https://grafana.openscapes.2i2c.cloud/d/edw06h7udjwg0b/cloud-cost-attribution?orgId=1.
Closing as completed, this is functional for openscapes, documentation on scaling this to other hubs was something I considered not to be part of the openscapes focused epic when being asked to provide a definition of done for this. Such future steps are now tracked in #4872.
As part of [Initiative] Hub Scale Cost Monitoring #4384, we want to support attributing costs to individual hubs on AWS.
We don't want to do this on all hubs on all clusters, but need to pick a cluster that has multiple hubs already in it to attribute costs. Let's pick openscapes - it has a staging and prod hub, but also a workshop hub!
While this EPIC is focused on openscapes, at the end of it, it would allow us to know exactly what we would need to do to do the same on any other cluster.
Definition of done