Open bgruening opened 2 years ago
@Renni771 will work on this.
https://mlco2.github.io/codecarbon/ also maybe useful. friendly cli interface and hooks into python too.
https://mlco2.github.io/codecarbon/ also maybe useful. friendly cli interface and hooks into python too.
Thanks for the link, will check it out. This resource also seems interesting and posting here for reference: https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator.
Do we have information about the specific energy mix, University of Freiburg is buying for the compute center? Maybe we could include both, the German-avg equivalent and the one from our specific energy mix
The initial carbon emissions reporting is now merged. The next step would be to determine the location of the server that ran each job - the more exact the better. The location influences the carbon intensity and power usage effectiveness variables of the carbon emissions calculation. Currently, we use global average values for both of these. We could also determine the client's location by examining their IP address.
Factoring in the client/server locations would not only make the carbon estimates more accurate, but would also allow us to explore with estimating things like data transfer costs along the wire.
I had a discussion with @bgruening and we decided that allowing admins to set the server location in the config file is a better option than dynamically determining this. The advantage of this approach is that it allows us to maintain a known record of locations and their respective carbon intensity values.
I've been messing around and added location and power usage effectiveness to the galaxy admin configuration. Here's what I've discovered:
geographical_server_location
flag to a valid ISO 3166 code. Why ISO 3166? Because it doesn't just cover countries, but also regions in a country like US-TX
for Texas, USA. This way we don't have to make up names for countries/regions and we get better geographical accuracy....
// source: carbonfootprint (March 2022) and Singapore Energy Market Authority (EMA) (data from 2020)
{
location: "SG",
name: "Singapore",
carbonIntensity: 408,
},
// source: carbonfootprint (March 2022) and Energy Policy and Planning Office (EPPO) Thai Government Ministry of Energy (data from 2020)
{
location: "TH",
name: "Thailand",
carbonIntensity: 481,
},
...
A record referring to a region within a country would look like this:
// source: carbonfootprint (March 2022) and Canada's submission to UN Framework convention on Climate Change (2021) (data from 2019 published in 2021)
{
location: "CA-QC",
name: "Quebec (Canada)",
carbonIntensity: 1.5,
},
power_usage_effectiveness
value in the config as well. This defaults to the global average of 1.67
referenced in the green algorithms tool here. Not sure how many people will actually do this, so I'm debating taking it out. Thoughts?CarbonEmissions
component dynamically updates its footnotes when different PUE and carbon intensity values are set. If you set values for PUE and carbon intensity that aren't the same as the defaults or global averages, you get this:
If no values where set, i.e the global defaults were set, the UI says that the global averages were used.It would be great to get carbon intensity data on the fly, but managing the list in house means we don't have to make API calls for each job whenever carbon emissions are calculated. How do you guys feel about this approach so far? Thoughts and input would be much appreciated.
I converted the file to JSON format and modified the fields so each location has the following format:
Galaxy is using mostly YAML files, so you think this would be possible here as well?
Please also collect all scripts that you need to convert the upstream files under /scripts/
You can set a power_usage_effectiveness value in the config as well. This defaults to the global average of 1.67 referenced in the green algorithms tool here. Not sure how many people will actually do this, so I'm debating taking it out. Thoughts?
I think I have a use case for it :)
Did I understand correctly that geographical_server_location
is a global default value? If so this will work for most of the servers and is good for the first iterations. Long term we need to get this value per job, as jobs can run in different locations in the world. I guess we would need a metrics plugin that returns a GeoIP or ISO 3166 location.
Galaxy is using mostly YAML files, so you think this would be possible here as well? Please also collect all scripts that you need to convert the upstream files under /scripts/
/ folder.
The source material the green algorithms project use is in raw CSV format. I also made an error, the test reference data I'm using in the initial implementation is a .js
file. I did this to logically group locations by continent and to allow for code-splitting - the current testing implementation does a single bulk export and would be be changed. So the file looks something like this:
// countryCarbonIntensity.js
const africa = [
...
// source: carbonfootprint (March 2022) and Climate Transparency (2021 Report) (data from 2020)
{
location: "ZA",
name: "South Africa",
carbonIntensity: 900.6,
},
...
]
export const countryCarbonIntensity = [
// source: https://www.iea.org/reports/global-energy-co2-status-report-2019/emissions
...africa,
...asia,
...europe,
]
I basically migrated this file 'by hand'. Is YAML recommended for larger data records? If this is the preferred method, I'll go ahead and re-write the source data in YAML.
I think I have a use case for it :)
In that case, I'll leave it in.
Did I understand correctly that geographical_server_location is a global default value? If so this will work for most of the servers and is good for the first iterations. Long term we need to get this value per job, as jobs can run in different locations in the world. I guess we would need a metrics plugin that returns a GeoIP or ISO 3166 location.
Yes. You set this once globaly in the galaxy.yml
config file. It would be nice to have a plugin that supports returning a GeoIP or ISO 3166 location per job. Sounds like the next step after this initial groundwork is done. Another question on this point: are there any cases where, for example, an EU Galaxy instance would send some jobs to another location like the US? I ask because already having a location set in the config could help narrow down how we determine location etc.
Galaxy config files are in YAML and would be preferred, but you are talking about upstream/source data, correct? Maybe we leave them as they are in the original format? Than Admins can easily update them and the logic to parse them is handled in the backend. The disadvantage is that if upstream changes the format we need to touch the Galaxy code - not sure how "stable" the format is.
ping @hungmung
In https://github.com/galaxyproject/galaxy/pull/9621 we have added optional AWS cost estimates, to raise awareness about the hidden cost of a Galaxy service.
Its long overdue but we should also add an indicator of how large the Carbon footprint of an analysis or a job is. We could use the same infrastructure as the AWS estimates.
A few interesting links: