Green-Software-Foundation / if

Impact Framework
https://if.greensoftware.foundation/
MIT License
139 stars 40 forks source link

Allow different aggregation methods for time and component aggregation #991

Closed jmcook1186 closed 4 days ago

jmcook1186 commented 2 weeks ago

What

Enable aggregation to use a different method (sum, avg, copy, none) for "horizontal" (time) and "vertical" (component) aggregation for a single parameter. Also rename horizontal and vertical to time and component aggregation respectively.

Why There are cases where we need to average across time series but sum across components, and vice-versa. A critical example is the SCI score - we have to be able to take an average of many snapshots of SCI taken within a single time series, but then sum across the components in a tree to give an overall SCI value.

Context

The SCI value is a rate.

If we have a functional unit of e.g. visits, and we have values for that per timestep, then we can gather an SCI score per timestep by doing carbon/visits. This is what our SCI plugin does in each timestep, in each component. However, now let’s say we want to do this over three components.

We have per-timestep SCI in units of gCO2e/visit in each of three components to aggregate up to a single value. We don’t want to sum across time, because what we end up with is not SCI - we’ll end up with an inflated rate that doesn’t represent the actual rate at any point during our times series, but instead a spuriously high one.

E.g. if you did 60 mph for an hour, you would cover 60 miles and your average speed would be 60 mph and your max speed would also be 60 mph, but if we added up the speed of your car measured every minute for an hour long journey, we’d end up saying you went 3600 mph. We're effectively doing this with SCI.

So instead, we actually want to set the aggregation method to avg, or we want to add a normalization step where we do generate a time-totalled SCI by setting the aggregation method to sum but then we divide by number of time-steps retroactively (which ends up being the exact same thing).

No problem, then, for calculating the average SCI per component, but now we want to aggregate across components. Now we really DO want to sum the SCI values together to yield one overarching value for the whole tree, but oh dear we already set our aggregation method to avg. So we can only get a spuriously LOW estimate in the top level aggregation because we are forced to average where we want to sum.

So, because we have a single aggregation method that covers both time and component aggregation, and we can’t do operations over values after they are aggregated - we can’t calculate SCI in a multi-component manifest inside IF.

To resolve this, we need to be able to configure the aggregation method for vertical/component and horizontal/time aggregation independently.

While we are here, we should also rename horizontal aggregation --> time aggregation, and vertical aggregation --> component aggregation, so that they are unambiguous.

I propose the parameter-metadata config is updated in all plugin source code the parameter-metadata type definition that allows for metadata overwriting, so that aggregation-method is an object with two fields: time and component which accept sum, avg, none or copy enum variants.

In plugin source code:

export const Sci = (
  config: ConfigParams,
  parametersMetadata: PluginParametersMetadata,
  mapping: MappingParams
): ExecutePlugin => {
  const metadata = {
    kind: 'execute',
    inputs: {
      ...({
        carbon: {
          description: 'an amount of carbon emitted into the atmosphere',
          unit: 'gCO2e',
          'aggregation-method':
            time: 'sum',
            component: 'sum'
        },
        'functional-unit': {
          description:
            'the name of the functional unit in which the final SCI value should be expressed, e.g. requests, users',
          unit: 'none',
          'aggregation-method': 
            time: 'sum',
            component: 'sum'
        },
      } as ParameterMetadata),
      ...parametersMetadata?.inputs,
    },
    outputs: parametersMetadata?.outputs || {
      sci: {
        description: 'carbon expressed in terms of the given functional unit',
        unit: 'gCO2e',
          'aggregation-method': 
            time: avg',
            component: 'sum'
      },
    },
  };

In manifest (when setting param metadata)

  sci:
    path: builtin
    method: Sci
    config:
      functional-unit: site-visits
    parameter-metadata:
      outputs:
        sci:
          unit: gCO2 / visit
          description: software carbon intensity
          aggregation-method: 
            time: avg
            component: sum

And the aggregation config should be updated to accept both, time and component, rather than both, horizontal and vertical.

Skipping components

While we are updating the aggregation feature, we should also support skipping named components from the aggregation. This is necessary to enable cross-component arithmetic. For example, imagine we have a component that is used to import page-visit data from an API, and we then want to use that as a functional unit in an SCI calculation across our manifest. If we aggregate using our current feature, we'll throw an exception because one of our components (the one with page-visits) won't have carbon values, so aggregation will fail. We don't want that - we just want to ignore that component in our aggregation.

So ideally we'll have aggregation config that supports skip-components, looking something like:

aggregation:
  metrics:
    - carbon
    - sci
  type: both
  skip-components:
    - page-visits # this maps to a component name

Error out if the names given in the aggregation config do not map to component names in the tree.

Prerequisites/resources n/a

SoW (scope of work)

Acceptance criteria

GIVEN the changes are implemented

WHEN I run the following manifest:

name: GSF Website SCI
description: Generates SCI score (gCO2eq/visit) for greensoftware.foundation website
tags:
aggregation:
  metrics:
    - carbon
    - sci
  type: both

initialize:
  plugins:
    sci:
      kind: plugin
      method: Sci
      path: "builtin"
      config:
        functional-unit: site-visits
      parameter-metadata:
        inputs:
          carbon:
            description: carbon emmitted in gCO2e
            unit: gCO2e
            aggregation-method: 
              time: 'sum'
              component: 'sum'
          site-visits:
            description: times site was visited
            unit: visit
            aggregation-method: 
              time: 'sum'
              component: 'sum'
        outputs:
          sci:
            description: software carbon intensity
            unit: gCO2 / visit
            aggregation-method: 
              time: 'avg'
              component: 'sum'

tree:
  children:
    component-1:
      pipeline:
        compute:
          - sci
      defaults:
      inputs:
        - timestamp: '2024-07-22T00:00:00'
          duration: 86400   
          site-visits: 228
          carbon: 0.0027
        - timestamp: '2024-07-23T00:00:00'
          duration: 86400   
          site-visits: 216
          carbon: 0.0027
        - timestamp: '2024-07-24T00:00:00'
          duration: 86400   
          site-visits: 203
          carbon: 0.0027

    component-2:
      pipeline:
        compute:
          - sci
      defaults:
      inputs:
        - timestamp: '2024-07-22T00:00:00'
          duration: 86400   
          site-visits: 228
          carbon: 0.0007
        - timestamp: '2024-07-23T00:00:00'
          duration: 86400   
          site-visits: 216
          carbon: 0.0007
        - timestamp: '2024-07-24T00:00:00'
          duration: 86400   
          site-visits: 203
          carbon: 0.0007

THEN I get the following result:

name: GSF Website SCI
description: Generates SCI score (gCO2eq/visit) for greensoftware.foundation website
tags:
aggregation:
  metrics:
    - carbon
    - sci
  type: both

initialize:
  plugins:
    sci:
      kind: plugin
      method: Sci
      path: "builtin"
      config:
        functional-unit: site-visits
      parameter-metadata:
        inputs:
          carbon:
            description: carbon emmitted in gCO2e
            unit: gCO2e
            aggregation-method: 
              time: 'sum'
              component: 'sum'
          site-visits:
            description: times site was visited
            unit: visit
            aggregation-method: 
              time: 'sum'
              component: 'sum'
        outputs:
          sci:
            description: software carbon intensity
            unit: gCO2 / visit
            aggregation-method: 
              time: 'avg'
              component: 'sum'

tree:
  children:
    component-1:
      pipeline:
        compute:
          - sci
      defaults:
      inputs:
        - timestamp: '2024-07-22T00:00:00'
          duration: 86400   
          site-visits: 228
          carbon: 0.0027
        - timestamp: '2024-07-23T00:00:00'
          duration: 86400   
          site-visits: 216
          carbon: 0.0027
        - timestamp: '2024-07-24T00:00:00'
          duration: 86400   
          site-visits: 203
          carbon: 0.0027
      outputs:
        - timestamp: '2024-07-22T00:00:00'
          duration: 86400   
          site-visits: 228
          carbon: 0.0007
          sci: 3.070175438596491e-06
        - timestamp: '2024-07-23T00:00:00'
          duration: 86400   
          site-visits: 216
          carbon: 0.0007
          sci: 3.2407407407407406e-06
        - timestamp: '2024-07-24T00:00:00'
          duration: 86400   
          site-visits: 203
          carbon: 0.0007
          sci: 3.4482758620689654e-06
      aggregated:
        carbon: 0.0021
        sci:  3.2530640138020657e-06

    component-2:
      pipeline:
        compute:
          - sci
      defaults:
      inputs:
        - timestamp: '2024-07-22T00:00:00'
          duration: 86400   
          site-visits: 228
          carbon: 0.0007
        - timestamp: '2024-07-23T00:00:00'
          duration: 86400   
          site-visits: 216
          carbon: 0.0007
        - timestamp: '2024-07-24T00:00:00'
          duration: 86400   
          site-visits: 203
          carbon: 0.0007
      outputs:
        - timestamp: '2024-07-22T00:00:00'
          duration: 86400   
          site-visits: 228
          carbon: 0.0007
          sci: 3.070175438596491e-06
        - timestamp: '2024-07-23T00:00:00'
          duration: 86400   
          site-visits: 216
          carbon: 0.0007
          sci: 3.2407407407407406e-06
        - timestamp: '2024-07-24T00:00:00'
          duration: 86400   
          site-visits: 203
          carbon: 0.0007
          sci: 3.4482758620689654e-06
      aggregated:
        carbon: 0.0021
        sci:  3.2530640138020657e-06
  outputs:
    - timestamp: '2024-07-22T00:00:00'
      duration: 86400   
      site-visits: 228
      carbon: 0.0014
      sci: 6.140350877192982e-06
    - timestamp: '2024-07-23T00:00:00'
      duration: 86400   
      site-visits: 216
      carbon: 0.0014
      sci: 6.481481481481481e-06
    - timestamp: '2024-07-24T00:00:00'
      duration: 86400   
      site-visits: 203
      carbon: 0.0014
      sci: 6.896551724137931e-06
  aggregated:
    carbon: 0.0028
    sci: 6.506128027604131e-06
jmcook1186 commented 2 weeks ago

@narekhovhannisyan please take a look and lmk if this all makes sense

narekhovhannisyan commented 2 weeks ago

@jmcook1186 Seems good to me, moving to in progress

zanete commented 2 weeks ago

if only 0one aggregation method given, then apply to both (shouldn't be a breaking change) @jmcook1186 - please provide per plugin info on the aggregation methods

jmcook1186 commented 1 week ago

@narekhovhannisyan added detail on component skipping to issue description

zanete commented 5 days ago

one blocking issue to discuss between @narekhovhannisyan and @jmcook1186 before a PR can be produced.