adrianco commented 11 months ago

Google publishes a table of CFE% which is a key piece of the information needed for this project, however the data is only disclosed for 2021 [update: there is GCP data for 2019-2022 on github] and as fas as I can tell, has not been disclosed by AWS or Azure.

For the RTC project, we need to define the data schema we would use to obtain data from any cloud provider, and find a mechanism that can manage the uncertainty in a current estimate, based on data from previous years.

https://cloud.google.com/sustainability/region-carbon - current content of this page is pasted below

adrianco commented 11 months ago

Given data from a previous year, the data could be improved by updating the value for grid carbon intensity for that specific region with a current value, this sets a maximum carbon level. The CFE% from private power purchases is available for a previous year. To estimate forward to today, the CFE% will get better whenever new private generation capacity comes online, will change up or down whenever the grid changes its carbon intensity, and could get worse as the capacity of the region increases energy demand. We could estimate a wide CFE% range based on old data to bracket the possible outcomes, or the cloud provider could publish a narrower CFE% range based on their internal knowledge of growth in consumption vs. PPA projects and REC purchases.

The Google data is based on a 24x7 hourly algorithm. Carbon data is currently published by all the cloud providers on a monthly basis, so CFE% or a range could also be published monthly. Final CFE% would be published annually a few months after the year ends, and the range would converge to a single value at this point.

So the request to the cloud providers would be to publish interim CFE% estimates on a monthly basis during the year, with past months closing the range to zero when data is final, and with a two month forward estimate.

adrianco commented 11 months ago

Clicking through the link to live data shows that this web page is out of date, and there is annual data for 2019-2022 available, updated a few months ago. https://github.com/GoogleCloudPlatform/region-carbon-info

This shows a range year on year of some regions slipping by a few % and some improving a lot.

adrianco commented 11 months ago

The Bigtable view of the data has more details, but seems to omit the 2022 dataset.

Screenshot 2023-10-23 at 5 09 15 PM

It does however provide an initial schema definition that looks like a useful basis to start with.

Screenshot 2023-10-23 at 5 12 56 PM

adrianco commented 11 months ago

Proposed schema for GSF RTC CFE year, month, hour (optional), resolution, cfe_region, zone_id, grid_carbon_intensity, cloud_region, location, cloud_provider, cfe_low, cfe, cfe_high

This adds monthly and optional hourly resolution, with a flag indicating the resolution of the underlying data, records the grid carbon intensity used as a basis by the cloud provider, abstracts cloud_provider into its own metric, and replaces google_cfe with the low , the most probable value and the high value. The low and high values are intended to be a 95% confidence interval. Current GCP data would be yearly with 24x7_hourly resolution. Current AWS data would be yearly with yearly resolution. If hour by hour data was shared via a real time API, the hour metric would be provided.

adrianco commented 11 months ago

AWS published a list of regions that are 95% renewable for 2021 and a larger list that are 100% renewable for 2022. The differences between AWS and GCP (as far as I can tell) are that GCP uses carbon offsets to zero out the remainder of its carbon emission on an hourly basis after adding in its PPAs and RECs to the grid mix. AWS uses local grid PPAs and RECs to buy 100% renewable electricity on an annual basis, but doesn't mention carbon offsets. AWS also has a lot more renewable energy generation projects in Asia than GCP.

adrianco commented 11 months ago

I haven't been able to figure out a source for CFE for Azure. Can someone from Microsoft comment?

seanmcilroy29 commented 11 months ago

@tmcclell - are you able to give some insight on this?

adrianco commented 11 months ago

Given the above schema, data would be shared via public BigTable, S3 or Azure Blob objects, and could be updated when the annual report is published, and also whenever a new renewable energy production facility comes online. There's often a PR story around the commitment to build, and the final power on of each facility, that could be tied to a data update.

adrianco commented 11 months ago

I haven't been able to figure out a source for CFE for Azure. Can someone from Microsoft comment?

On our discussion call today, Ritesh confirmed that Azure does not publish CFE%

adrianco commented 11 months ago

An initial step for the RTC project could be to publish a CFE% table on GitHub based on the data we have available now, but making a best estimate of what the end result would look like for all the cloud providers combined. Later, if the cloud providers supply more or new data, it could be blended in. This data isn't expected to update rapidly.

adrianco commented 9 months ago

Azure CFE% is published for each region (along with PUE) in their Datacenter Facts info https://datacenters.microsoft.com/globe/fact-sheets. The CFE data for Google, Azure and AWS has been extracted into a Google Sheet and lined up as much as possible, and it is proposed that this be exported via the Impact Framework https://github.com/Green-Software-Foundation/if

adrianco commented 9 months ago

The current sheet of raw data is here - this contains guesses for current ranges and is a work in progress at this point https://docs.google.com/spreadsheets/d/1RKjD4CuI5bd7JTj-9Mi1-ZhTIc6OW7TH9hUvLPbLsPA/edit?usp=sharing

seanmcilroy29 commented 9 months ago

8 - Purchased Renewable Energy is not settled for a year - overlaps with this issue

adrianco commented 8 months ago

I have restructured the spreadsheet and simplified it a bit. I removed the 2023 estimates that I had made. I moved hourly and annual data to individual columns and removed the column that was tagging the data. Now it contains only the actual published data coming from cloud providers.

There is a column for marginal carbon, should we try to populate it, or add our own interpolations or publish just the pure data from cloud providers?

There is a lot of missing data, should we leave it blank, or populate with NA so that people don't take zero values for blank?

We need to review this and decide if it's ready to publish as an Impact Framework model. https://docs.google.com/spreadsheets/d/1RKjD4CuI5bd7JTj-9Mi1-ZhTIc6OW7TH9hUvLPbLsPA/edit#gid=0

adrianco commented 8 months ago

The Impact Framework uses a hyphenated lower case naming strategy, so I changed all the column headers to match that.

The SCI-o (operational) model requires grid-carbon-intensity as it's input, so we need to define required inputs, which I have color coded red, required output color coded green, and information output color coded blue.

inputs: year ("2019", "2020", "2021", "2022" are currently valid) cloud-provider ("Google Cloud", "Amazon Web Services", "Microsoft Azure" are currently valid) cloud-region ("us-east1", "us-east-1", "eastus" format unique to the cloud provider)

output: grid-carbon-intensity (numeric, grams of CO2e/kWh) various other informational metrics

I've calculated effective grid-carbon-intensity for Google, given that the data is location based which SCI-o requires. We need to decide what if anything to output for AWS and Azure

adrianco commented 8 months ago

For Azure, we could use the annual data, and reduce the value given by electricity maps (or whattime, but the google data is sourced from EM) by the CFE ratio. For Amazon if we did the same we would return zero for most of the regions. In both cases, this isn't really the data that SCI-o wants and isn't comparable to Google data.

adrianco commented 8 months ago

Maybe we have an optional input value "market" which returns the market method numbers for Azure and AWS, and NA for Google.

jawache commented 8 months ago

Hi, @adrianco reviewed the above. Are the following assumptions correct?

You are looking for a way to adjust carbon emissions to consider market-based measures like renewable purchases.
The approach you are proposing is to adjust the grid-carbon-intensity value by the per-region coefficient of CFE% (Something is ringing in my head that we are missing something and CFE% can't be used in this way, I might just need to think it through, and work on an example)
The XL includes some inputs of cloud region, vendor, and outputs of cfe and an adjusted grid-carbon intensity (yearly values I assume)
We are going to maintain these values in a CSV format manually.
I can see in the CSV that some data is quite old. Can we assume that if you request a CFE value, we just return the latest if we don't have data for that year?

Given the above, I believe the goal is to create a model that adjusts (reduces) a grid-carbon intensity value to take into account the CFE% of a region. Downstream models will then adjust the final carbon emissions value with this new grid-carbon-intensity so it represents the investment into 24/7 renewable purchases by that cloud provider.

I propose this is split into several impact framework plugins, we have a philosophy that each plugin does one thing so we can mix and match plugins in different pipelines.

carbon-free-energy plugin (or real-time-cloud plugin)

this is where we maintain the CSV data above.
the inputs it needs are cloud vendor and cloud region, and default inputs are also timestamp and duration, and optionally grid-carbon-intensity.
it outputs the latest carbon-free-energy figure used by that cloud vendor in that region.
if the timestamp is for a year for which the data is not present in the CSV if in the past, it will assume 0% CFE; if in the future, it will use the latest CFE that is in the CSV.
If the input already contains grid-carbon-intensity, then it uses that value. Otherwise, it takes the grid-carbon-intensity value from the CSV. (In most use cases in the IF the grid-carbon-intensity would already be provided, we have a wattime plugin and plans for an em plugin etc...)
Adjusts the grid-carbon-intensity value so it reflects the cfe value from the CSV.
If there was a previous grid-carbon-intensity value then it would also copy that to another field so we at least have a record of the non-cfe-grid-carbon-intensity.

If the above is roughly correct, I will spec out a proper plugin spec in the format we use in the impact framework to ensure that nothing is left to assumptions, with clear inputs and expected outputs.

NOTE: I suspect we should do a name change in IF, not use grid-carbon-intensity and instead use electricity-carbon-intensity instead. If you are adjusting a grid (location-based) carbon intensity with some market-based measures, then the term grid isn't accurate anymore.

Potential pipeline

pipeline:
  - teads-curve # compute energy from utilization
  - watttime # to get grid-carbon-intensity
  - carbon-free-energy # to adjust grid-carbon-intensity w.r.t. the cfe for that cloud region
  - sci-o # to compute carbon from energy + grid-carbon-intensity

seanmcilroy29 commented 8 months ago

Project members agreed to create a document guideline to explain the headers for GSF Real-Time Cloud Renewable Energy Percentage

Collaboration Google doc to be used for drafting prior to adding to GitHub

adrianco commented 7 months ago

We discussed the document today and after the meeting I created a first draft of the documentation document. I re-ordered and re-named the columns in the sheet to rationalize them and had a first pass at explaining what each metric means and where it comes from. The sheet still needs to have more data filled in, then have missing data marked with NA.

adrianco commented 7 months ago

Hi, @adrianco reviewed the above. Are the following assumptions correct?

I think this is close, but we are thinking a bit differently about how it would fit in

You are looking for a way to adjust carbon emissions to consider market-based measures like renewable purchases.

The approach you are proposing is to adjust the grid-carbon-intensity value by the per-region coefficient of CFE% (Something is ringing in my head that we are missing something and CFE% can't be used in this way, I might just need to think it through, and work on an example)

Correct, it can't be used in this way, and still be a compliant location based carbon estimate.

The XL includes some inputs of cloud region, vendor, and outputs of cfe and an adjusted grid-carbon intensity (yearly values I assume)

We are going to maintain these values in a CSV format manually.

I can see in the CSV that some data is quite old. Can we assume that if you request a CFE value, we just return the latest if we don't have data for that year?

Data will always be between 6 and 18 months in the past. A separate IF model step should be used to estimate current data.

Given the above, I believe the goal is to create a model that adjusts (reduces) a grid-carbon intensity value to take into account the CFE% of a region. Downstream models will then adjust the final carbon emissions value with this new grid-carbon-intensity so it represents the investment into 24/7 renewable purchases by that cloud provider.

The goal is to get all the information about a cloud provider region in a consistent format that can be used for various purposes. We aren't going to invent a new methodology that isn't a valid model.

I propose this is split into several impact framework plugins, we have a philosophy that each plugin does one thing so we can mix and match plugins in different pipelines.

This plugin gets all the cloud region data, that's all. It should be at the front of the pipeline for workloads running in the cloud.

carbon-free-energy plugin (or real-time-cloud plugin)

this is where we maintain the CSV data above.

the inputs it needs are cloud vendor and cloud region, and default inputs are also timestamp yes, just these three

and duration, and optionally grid-carbon-intensity. It won't use duration, and it should be earlier in the pipeline before grid-carbon-intensity is obtained

it outputs the latest carbon-free-energy figure used by that cloud vendor in that region. Yes, along with other data about that region

if the timestamp is for a year for which the data is not present in the CSV if in the past, it will assume 0% CFE; if in the future, it will use the latest CFE that is in the CSV. Yes, that works.

If the input already contains grid-carbon-intensity, then it uses that value. Otherwise, it takes the grid-carbon-intensity value from the CSV. (In most use cases in the IF the grid-carbon-intensity would already be provided, we have a wattime plugin and plans for an em plugin etc...)

No I think it goes before the wattime or em plugin. It outputs the EM or WT key that can be used to pull current data for that grid region, given only the cloud provider and region.

Adjusts the grid-carbon-intensity value so it reflects the cfe value from the CSV. no

If there was a previous grid-carbon-intensity value then it would also copy that to another field so we at least have a record of the non-cfe-grid-carbon-intensity. no need. It would output an annual average grid-carbon-intensity, that could be used directly, or could be refined to a more accurate grid-carbon-intensity for a more specific time period by calling a wt or em plugin.

If the above is roughly correct, I will spec out a proper plugin spec in the format we use in the impact framework to ensure that nothing is left to assumptions, with clear inputs and expected outputs.

NOTE: I suspect we should do a name change in IF, not use grid-carbon-intensity and instead use electricity-carbon-intensity instead. If you are adjusting a grid (location-based) carbon intensity with some market-based measures, then the term grid isn't accurate anymore.

we aren't doing that.

Potential pipeline

pipeline:
  - teads-curve # compute energy from utilization
  - watttime # to get grid-carbon-intensity
  - carbon-free-energy # to adjust grid-carbon-intensity w.r.t. the cfe for that cloud region
  - sci-o # to compute carbon from energy + grid-carbon-intensity

I think it looks like this

pipeline:
  - teads-curve # compute energy from utilization
  - cloud-region # look up the cloud region info
  - watttime # optional to get grid-carbon-intensity for now rather than annual data on google (other clouds are NA)
  - sci-o # to compute carbon from energy + grid-carbon-intensity (should also use PUE in the calculation)

There are other uses for the CFE data, perhaps in a tool that picks an optimal region for you.

adrianco commented 7 months ago

Given a tool like https://gcping.com to find the regions that are closest to someone, the cloud-region data could be used to pick the best CFE that is nearby.

jawache commented 7 months ago

Gotcha thanks @adrianco, that's clear and simple.

The IF team are doing a rearch sprint next two weeks so will hold off on writing a spec till that is complete.

The IF team themselves will be flat out till April. What would you say about specing this out in some detail and then sharing it with hackathon participants, see if any of them are interested in taking it up?

jawache commented 6 months ago

@adrianco and @seanmcilroy29, part of the work above has been slated for the next sprint in IF, just the location style fields for now, see https://github.com/orgs/Green-Software-Foundation/projects/26/views/9?pane=issue&itemId=53858077

@adrianco to be more general purpose I've added in a geolocation field, if we can support a lat,lon that would make this data be more useful when using services outside of em or wt. The IF team will make the effort to compute this using whatever is the central lat,lon of the location field. Let me know if there is a better altnerative?

adrianco commented 5 months ago

Summary document header names rationalized and copied to spreadsheet. Spreadsheet tidied up, year color coding filled out. NA added to grid-carbon-intensity output that will be consumed by SCI for AWS and Azure Still need to fill out some columns with geolocation data, EM and WT and IEA reference information.

adrianco commented 5 months ago

New detailed issues to complete this data source https://github.com/Green-Software-Foundation/real-time-cloud/issues/31 - add geolocation data https://github.com/Green-Software-Foundation/real-time-cloud/issues/32 - add Watttime data https://github.com/Green-Software-Foundation/real-time-cloud/issues/33 - add Electricitymaps data https://github.com/Green-Software-Foundation/real-time-cloud/issues/34 - figure out IEC data and add it https://github.com/Green-Software-Foundation/real-time-cloud/issues/35 - cfe region and any other issues

adrianco commented 2 months ago

Data finalized in cloud region metadata proposal

Green-Software-Foundation / real-time-cloud

Carbon Free Energy estimates as an Impact Framework dataset #14

8 - Purchased Renewable Energy is not settled for a year - overlaps with this issue

carbon-free-energy plugin (or real-time-cloud plugin)

carbon-free-energy plugin (or real-time-cloud plugin)

Green-Software-Foundation / real-time-cloud

Carbon Free Energy estimates as an Impact Framework dataset #14

Carbon free energy for Google Cloud regions

A carbon-free cloud for our customers

Understanding the data

Carbon data across GCP regions

How to incorporate carbon free energy in your location strategy

Low carbon indicators

8 - Purchased Renewable Energy is not settled for a year - overlaps with this issue

carbon-free-energy plugin (or real-time-cloud plugin)

carbon-free-energy plugin (or real-time-cloud plugin)