Closed consideRatio closed 3 weeks ago
The definition of done looks good to me, @consideRatio.
If we go for Cost Explorer API, work to define/refine further tasks to be worked is needed.
If this isn't part of the spike, once the spike is done can you create another issue to track this? Thanks.
Picking it up now with some initial reading at the end of my day, to be continued tomorrow.
I've arrived at what I consider sufficient grounds for a Decision to move ahead with Cost Explorer API.
It seems technicallt very viable, and the mhe motivation by Yuvi for using Cost Explorer API over Athena is sufficient in my mind.
There are a few major advantages over using Athena:
- Much easier to validate, as we aren't writing complex SQL queries but translating what we can visually do in the cost explorer into API calls.
- Athena is not per AWS account but at the AWS organization level, so we would have needed an intermediate layer anyway for cases when we use the 2i2c AWS organization. We wouldn't have needed this for Openscapes, but trying to use it for any of our other AWS accounts would've required an intermediate python layer for access control (so different communities can't see ach other's data).
Another positive conclusion is that it seems that we can avoid needing much complexity within the Python intermediary, and can put that complexity in the Grafana queries instead. This is because the infinity plugins seem to allow for notable post-processing of the JSON responses. Due to this, we can probably more responsively and quickly iterate on the cost dashboards and improve them, letting the Python intermediary be a quite slimmed project with relatively low complexity, making it more viable for re-use by others as well.
Another positive conclusion is that it seems that we can avoid needing much complexity within the Python intermediary, and can put that complexity in the Grafana queries instead.
Given that we'll be working on https://2i2c.productboard.com/roadmap/7803626-product-delivery-flow/features/27195081 in the future, as well as possibly needing to extend this work onto GCP, and the recommendations in https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api-best-practices.html#ce-api-best-practices-optimize-costs, I'd like most of the complexity to actually be in the python layer, and not in the grafana layer. Fixing issues in Python code is also far more accessible to more team members and other open source contributors than fixing it in jsonnet + the filtering languages that the grafana plugin uses. So let's use the grafana plugin as primarily a visual display layer, and keep most of the complexity in the python code.
This task is blocking tasks towards attributing costs using Athena, as Yuvi learned about another approach to be evaluated first. This is described in https://github.com/2i2c-org/infrastructure/issues/4453#issuecomment-2301947867:
Practical spike steps
I think this has to be updated continuously as part of the spike, but the goal is to clarify and verify that its reasonable to move towards using the Cost Explorer API.
[x] Understand and clarify details of Yuvi's step
2. An intermediate python web server, that talks to the Cost Explorer API
. My preliminary understanding is that we would opt-in to deploy something from the support chart for this, and that it may need credentials setup via terraform to access the Cost Explorer API.aws/aws-sdk-pandas
akaawswrangler
be worth using? Its not clear if we should useawswrangler
, but I think the path is to assume we don't until we have a known need to manipulate the response from Cost Explorer API before serving it back to Grafana.Probably not much data transformation, but it needs to expose a bridge API so that Grafana requests to populate our wanted dashboards can be translated to Cost Explorer API responses.
[x] What kind of queries are to be expected to come from Grafana, and what kind of response is expected? We need to consider time ranges etc right?
I think the responses from the Python server should be JSON, and respect time ranges in query parameters and be able to filter etc based on relevant things to filter on.
Exploration of existing things
Python dependencies: relies on
boto3
andbotocore
, notawswrangler
Cost API used:get_cost_and_usage
as seen here[x] Should the Python webserver mostly passthrough requests to the Cost Explorer API or should it map requests in a more hardcoded way? Ideally, we can avoid hardcoding a mapping between a Python web API returning JSON and the Cost Explorer API, and instead do it more in a passthrough way to relevant endpoints in the Cost Explorer API.
It seems that we can do quite a bit of things with raw JSON data by crafting queries against the infinity datasource, that then post-processes the JSON response. Due to this, I think the key thing we should ensure, is that we through the python intermediary provides relevant JSON responses for post-processing by the infinity datasource query.
Definition of done
Potential followup work not part of spike