Green-Software-Foundation / real-time-cloud

Other
45 stars 1 forks source link

How can energy data collection and reporting be implemented by a cloud provider? #4

Open adrianco opened 1 year ago

adrianco commented 1 year ago

Cloud providers may not be collecting energy use at a system level across their fleet of machines at present, so there could be a development and deployment cost to provide this information. Raw energy data can't be provided at the virtual machine instance because it's only collected at the full system level, and there are security implications - an Intel CVE https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html - this issue provides a place to discuss workarounds and solutions for this problem.

adrianco commented 1 year ago

Workaround - Cloud providers may be able to supply what are known as "bare metal" instances that are a complete machine, with no hypervisor and no partitioning. On those instance types it may be ok to allow access to interfaces such as Intel RAPL that would allow energy monitoring for the whole instance. Questions: Which cloud providers supply bare metal instances, and do they currently allow or block RAPL?

adrianco commented 1 year ago

How is energy collected in datacenters? The PDUs instrument power usage by each outlet, there's a different API depending on which vendor is used. APC is a common vendor. I was talking to Rob Hirschfield of RackN who knows these APIs well and may be able to help us figure out how to collect the data.

adrianco commented 1 year ago

Workaround - Cloud providers may be able to supply what are known as "bare metal" instances that are a complete machine, with no hypervisor and no partitioning. On those instance types it may be ok to allow access to interfaces such as Intel RAPL that would allow energy monitoring for the whole instance. Questions: Which cloud providers supply bare metal instances, and do they currently allow or block RAPL?

It appears that AWS EC2 bare metal instances do not block RAPL. One next step is to make a list of those bare metal instance types and see if Kepler's model can be calibrated based on real bare metal data.

ArneTR commented 1 year ago

Hey @adrianco, just stumbled over this post as we were writing an overview post for ourselves lately.

Did you know that Teads has a list with RAPL data which also includes machines from AWS, Scaleway, Equinix etc. that supposedly allow RAPL access? This could provide very helpful: https://docs.google.com/spreadsheets/d/1DqYgQnEDLQVQm5acMAhLgHLD8xXCG9BIrk-_Nv6jF3k/edit#gid=985503428

Also, as said, we have written up a little piece, as we were looking into what MSRs are available for some cloud vendors as well as what Hypervisors they are running. Maybe also helpful: https://www.green-coding.berlin/blog/cloud-energy-usage-data/

I also linked the awesome project you are leading here :)

adrianco commented 7 months ago

We discussed this a bit and decided that we need to investigate the RedFish API in more detail as it is more general than RAPL, it's a DMTF standard, and Kepler has figured out how to use it. Next step is to coordinate with Kepler team to see if we can share in what they have learned.

adrianco commented 7 months ago

Cloud providers may not be currently logging energy data for all their machines, so the additional cost of providing it as an API would be high in that case. An alternative of on-demand logging of energy data would be less overhead but still could be a significant engineering project to implement. A lighter weight alternative would be for each cloud provider to publish a calibration curve that maps utilization to power consumption. This works fairly well for simple CPUs, has issues with Hyperthreading, and doesn't work for GPUs - which are of particular interest now that they are becoming common and use a lot more power than CPUs. Calibration curves are available for CPU types that map to datacenter usage or bare metal instances, but there are a lot of custom CPU chips in use at cloud provider, both special versions of Intel and AMD parts and fully custom ARM designs and GPU/TPU accelerators.