Green-Software-Foundation / if

Impact Framework
https://if.greensoftware.foundation/
MIT License
159 stars 41 forks source link

EPIC - Time Grid #1057

Open jawache opened 1 month ago

jawache commented 1 month ago

Sub of #1025

Tasks:

Terminology

Background

Since close to the inception of Impact Framework, we've ensured time is one of the dimensions that the IF reports impacts. We want to surface where it makes sense how the impacts change over time to make sure we can highlight the moments where it's high and allow people to dive into the why.

The problem is that we need to sync up the start, end, durations, and number of observations across each component to ensure everything snaps to a grid so that we can sum up data across all components for every time bucket.

When is having a synced set of observations essential?

Plugins don't need the data snapped to a grid, it's primarily useful for aggregation only.

This is not easy; it has caused several problems and complexities, making writing manifest files unusually hard.

Problem

What determines the time window and durations of a component?

It's up to the user, component by component, to make sure the observations are snapped to one standard global time grid.

There is a complex relationship between grouping and time-syncing

Grouping is needed for time syncing, and it's not always clear where you have to put the TimeSync plugin, it has to be after grouping, but it might not be immediately after. Time syncing reduces the usefulness of grouping since you can't group it however you want.

Solution

Have the user define a global time grid and make the framework (including plugins) responsible for automatically aligning all the time series to that time grid.

We define a global time-grid setting where we define the start, end, and window. start and end can be hardcoded date times or relative offsets, e.g., start = now, end can be 30 mins ago, and window can be 60.

We start by making sure everything is snapped to the globally configured time grid:

Plugins are responsible for ensuring that any returning observations are snapped to that global time grid:

We provide plugin authors some support in helping them align their time series to the time grid:

How does aggregation work in a world where all observations are guaranteed to snap to a time-grid?

For example, a component does not have a unique time series.

6bca1e2c3bf018391f0969b43f108244

E.g. Component has a sparse time series

The component is missing some data in cell, that's ok - we are just summing up every cell it does have, up the tree.

E.g. Component represents servers spinning up and then down.

9bcd1a52235099b0999adc39a3696cb8

91ddf05480c83bd4cb3b1f27b5187fe9

Advantages of this approach

jawache commented 1 month ago

FYI @jmcook1186 and @zanete I've added the design doc here for the time-grid proposal.

jmcook1186 commented 1 month ago

@jawache @zanete @narekhovhannisyan

Some additional context on time-sync and time-grid - eventually getting around to the potential pain points that we ought to discuss in refinement:

Ok, agree wholeheartedly that the current time-sync is not perfect because of the need to position it correctly in each pipeline.

what I understand from your design ticket is that the actual functionality of time-sync is ok, but it needs refactoring so that it runs as a part of the framework execution flow and NOT as a plugin, which I also agree would be an improvement.

So, now let's get into the nitty gritty, which is the mechanics of "snapping to the grid", which is a phrase hiding a lot of complexity.

When we say "snap to grid" we are proposing to execute all, or a subset of the following operations:

Let's first look at the "right" way to do each of these before examining whats right for IF manifest files. It's logical to start wth resampling, because unless your initial time series already has an interval equal to that defined in the time-grid config, you can't just trim from the start and end and have timestamps that synchronize - correcting the interval is a necessary precondition.

resampling

Resampling can mean coarsening data to a lower time resolution (e.g. seconds to minutes) - this is "downsampling" - or adding finer resolution ticks between the existing ones (e.g. minutes to seconds) - this is "upsampling".

Both require the creation of new information, which is typically done by applying some function to the existing data and then evaluating that function at the new set of points. A common example is using a least square regression. You find a function f(x) that minimizes the squared error between the known values at a set of x values and the estimates yielded by the function at those same x values. Depending on the type of data the evaluation might be at a point or it might be a definite integral between t and t+duration.

Why don't we do this in IF? Several reasons:

What do we do instead?

Today we assign each parameter a method that is used for both aggregation and time sync. The method reveals whether a value is additive, constant or proportional, and we dealwith each type individually.

For all time series, we divide up time into intervals of 1s duration. Then, we recompose a time series to the desired interval according to the method for each value:

The resampling works as long as the time series is continuous, meaning tn+1 === tn + duration_n + 1. The complication comes when we have gaps in the time series. Then, we have to decide how to fill them. Today, we infill with "zeroish objects" which have the desired timestamps and durations required by the time-syncconfig but the other values are all set to zero, if they have method==sum or method==avg but set to their constant values if they have method==none. This isn't a perfect solution - for example, cpu-util == 0 can be interpreted as a server that is shut down (i.e. power==0.) or a server that is idle (and using ~50-70% of the power at 100% utilization).

Some combination of the above procedures is used to adjust the intervals of each time series so they match the interval of the time-sync config.

In summary, we decompose every time series to 1s resolution, then batch those 1s observations into new timestamps with the desired interval using an aggregation method defined for each parameter.

trimming

Once we have resampled, trimming is easy - we just discard any data whose timestamp is earlier than the defined start time, or later than the defined end time.

extrapolating

We don't actually extrapolate any data to extend time series - we just fill with our aforementioned zeroish objects until the first timestamp matches the defined start, and the last timestamp matches the define end. There are many ways we could choose to extrapolate the dataset, but we decided that it's best not to create data we don't have any knowledge of outside of the time range of the given observations.

How this works with aggregation

The reason this time-sync protocol works well with our time and component aggregation is because the methods used to resample each parameter are the same as the methods used to aggregate those parameters over time, or across components. It's very important to be able to configure those methods, because as we have seen for SCI vs PCF, sometimes it's right to average observations within a time series but sum them across components, or vice-versa, and some parameters only make sense to hold constant. We need a way to label these choices for each parameter and the right choice is context-dependent so it must be easy to change. It will also break things if you change the method for aggregation but not for time-sync or the other way around, so it really does make sense for both time-sync and aggregation to refer to the same variable to determine their method, rather than being defined separately.

Implications for time-grid

We need to persist the coupling between time series resampling and aggregation, because:

All the logic I described above amounts to a lot of complicated code. It doesn't feel great to offload this to plugins unless we can move it into if-core and expose it in a simple way to plugins, similar to how we provide error handling.

The ability to use "fills" as well as interpolations and copies is also very important, as we have to be able to maintain the "spikiness" of time series after resampling, for example, if in a month's work of daily observations, a server is only spun up once for 30 mins, our time series must be flat except for a burst coincident with the real moment of activity for that server, and not artificially smooth, redistribute or extent the values elsewhere in the time series.

There are nuances to how we treat duration too that need to be more tightly specified as we refine this ticket. Today it's not completely clear how to use duration correctly, especially when we are using static time series hardcoded into the manifest. For example, say we have daily timesteps covering a month, and on five days in the month we hold zoom calls that last 3 minutes. We might set the duration to 1800 seconds to indicate that the call lasted that long, but since there is no energy used outside of that 30 minutes at all, and our data is at daily resolution, it's acceptable to have duration set to 86400 seconds (1 day), with the advantage that this yields a continuous time series. However, when time-grid is used to resample the time series by default, this will be problematic, as it will often create artefacts that distort the original distribution.

Non-unique time series might pose some challenges for resampling, as we'll have to be accepting of repeated timestamps but also redistribute values over time appropriately. There will need to be some logic to separate out time series with repeating timestamps into multiple unique time series first, resample them all individually, and then recombine. Otherwise, I'm struggling to imagine what the resampling algorithm looks like.

Since we can have non-unique time series in time-grid we need to be very robust in handling zeros and ensure we never error out because of zero values for any parameter during aggregation, as this situation will probably be more common.

We will also have to be able to provide custom fill values for certain named parameters. A good example is cpu-util which could be infilled with 0, off or an interpolate value, each having very different outcomes for the time series that can make the difference between realistic and plain wrong.

Integration issues with some plugins

Some plugins such as Watt-time, Github and others have limits on the time range and interval served by their API. This is a problem if we resample before making requests, but ok if we resample after making requests, assuming the initial time series is in range for the target APIs. We will need some strategy to gracefully handle failing API requests in those cases.

jawache commented 1 month ago

Good points @jmcook1186, I agree we need to think through all these points - regardless of where we end up, we need to document the snapping process and the complex nuanced issues that arise from different choices.

This isn't a perfect solution - for example, cpu-util == 0 can be interpreted as a server that is shut down (i.e. power==0.) or a server that is idle (and using ~50-70% of the power at 100% utilization).

Very interesting point. I agree; if we are sampling, then we are making assumptions, and there are many ways those assumptions can be wrong.

The only people who know for sure what data to use is the plugin or the manifest builder, IF as a framework doesn't know and will need to be configured to act correctly.

Some plugins such as Watt-time, Github and others have limits on the time range and interval served by their API. This is a problem if we resample before making requests, but ok if we resample after making requests, assuming the initial time series is in range for the target APIs. We will need some strategy to gracefully handle failing API requests in those cases.

The above also implies that the responsibility for returning data snapped correctly lies with the plugin/data source itself. We do quite a "rough" sampling, but maybe the WattTime folks have strong opinions that any sampling our framework does wouldn't do the data any good. I also think both those APIs will return data in the grid you want (I think, not dived into them myself).

There are nuances to how we treat duration too that need to be more tightly specified as we refine this ticket. Today it's not completely clear how to use duration correctly, especially when we are using static time series hardcoded into the manifest. For example, say we have daily timesteps covering a month, and on five days in the month we hold zoom calls that last 3 minutes. We might set the duration to 1800 seconds to indicate that the call lasted that long, but since there is no energy used outside of that 30 minutes at all, and our data is at daily resolution, it's acceptable to have duration set to 86400 seconds (1 day), with the advantage that this yields a continuous time series. However, when time-grid is used to resample the time series by default, this will be problematic, as it will often create artefacts that distort the original distribution.

This is a super interesting point, and I hadn't considered it.

There is an expectation, say for a time-series of CPU util, that the CPU util is the average for the duration. There might be lots of variability inside the duration, but by the time we get the data, it's been averaged out over the duration.

When it comes to a time series of the number of visits to a website, this is the number of visits for the duration. If gathering manually via Google Analytics, GA does the sampling/aggregation to get the data in the granularity you want.

Regarding the zoom call, something for which "duration" has a special meaning, we perhaps mean "number of minutes of Zoom call for that duration", so in a day, we might have 160 minutes of zoom calls?

So in a way, we are already getting some form of aggregated data, with their own logic for how to aggregate depending on the type of data. CPU util does an average, visits and zoom-minutes does a sum. So even before we get the data, there is already some timeseries maths going on.

Who's responsible?

All the logic I described above amounts to a lot of complicated code. It doesn't feel great to offload this to plugins.

I agree it's like a hot potato and we're figuring out where it should land.

If we put the responsibility on plugin authors, they just need to do this once properly for each plugin at code time. If we put the responsibility on the end user then they will each need to build up expertise in time snapping to know what to chose for each situation. If we put the responsibility on ourselves then we will need a framework which works for all use cases and my worry is that to do this properly we'll end up building a competitor, or a very poor cousin, to one of the existing time series databases. (Also putting this on ourselves, also means putting this on the end user, since we are are just making ourselves configurable).

So basically what I'm proposing is not to move TimeSyncing to the Framework from a Plugin, but to remove it entirely from the Framework and into Plugins.

In "Grid Mode" all we would do is:

My assumption is that most existing importer plugins either don't care, and already return data pre-snapped to a grid or can be easily converted to do so.

My second assumption is that anyone who needs to do some complex snapping, can either use a library version of TimeSync or perhaps another solution like an in-mem DB or maybe another 3rd party library.

I suppose my third assumption is that snapping to a grid is a rare activity, it's the first plugins job and the rest of the plugins just care about individual observations. The only plugin I know of that cares about snapping post-importing, is WattTime and their new API already does the snapping for you (I think) but I could be wrong.

There is also the more manual use case. If you can't gather data pre-snapped we could create a helper generic csv-ts-importer plugin which imports data, but also snaps it to a grid via config rules and TimeSync. But importantly, the plugin snaps the data to the grid and the framework itself doesn't do the snapping.

jawache commented 1 month ago

There is also the option one just dropping time-series support, it's so complex with so many edge cases.

As in calculating a single value is very easy, spreading that over a time-series seems to add an order of magnitude complexity and i'm not 100% it's worth it.

(I know I've been the one banging my drum about timeseires/grids for a while).

jawache commented 1 month ago

@jmcook1186 I can't stop thinking about what if we just dropped snapping/syncing all together for now, maybe it's a 2.0 feature once we've got many more users.

I write all the above with the full knowledge I've been championing time series all along, but it's hitting me that if we just drop it for now, we make a simpler product that's much easier to adopt.

jawache commented 1 month ago

And sorry, you're getting a stream of consciousness now 😅 One more thing, I just realised what if we treated time as a continuous variable we sampled instead of a discrete variable we bucketed?

I still think this is 2.0, I like dropping time completely for now.

What if instead of snapping data to discrete buckets and aggregating up, what if grouping nodes just sampled data from underlying components?

So for the website SCI example, it's 31 discrete day buckets, but if you want you could just say "what was the carbon at midnight every Monday" and we just take carbon from whatever observations that match.

That would result in the grouping nodes having 4 observations instead of 31, and the grouping node observations would not have a duration, it's just the data from that instance of time, sampled from the sub components.

The totals for each component would aggregate up, so the overall total would be accurate.

The sampled time series for each grouping node would not be accurate, but they would be indicators of how the data is changing underneath.

So we basically stop treating time series as the source of truth, but just as a way to visualise how the underlying data is changing.

Sorry, again I still think this is 2.0 but didn't want to lose this thought.

jmcook1186 commented 1 month ago

@jawache

My gut reaction is that it will be a real shame to drop time series support. I understand that, if you are using IF as an alternative route to a PCF or LCA, then our treatment of time can be simplified a lot. But, we are not just a reporting tool - we are perhaps most useful as a tool for introspection, decision making and mitigation analysis. In particular, we prioritise the SCI as a computable metric because of its appropriateness as a KPI. A useful KPI is addressable in near real time. If we don't handle data that's granular in time, we don't support real time course corrections or the ability to model mitigation strategies into the future.

So here's a counter-proposal with some simplification, better UX for users, and visualizer support but without dropping time series altogether:

Confine time-sync to the observe phase

If you already have static data you can just use it as normal, as you skip observe anyway. This also frees people that are already skilled to do their own preprocessing of their time series externally to IF, e.g. in Pandas or whatever, and then add it to their manifest, as they can just skip observe.

To support cases where people have data that's not naturally time-series have incomplete or uneven time series and they are fine with that, we can revisit aggregate to handle those data with clear caveats (i.e., aggregate just naively does sum/avg/copy of whatever it finds in each component's output array according to each parameter's method) and then aggregates the across components, without trying to do time slot by time slot aggregation - just don't configure time-sync and we'll still do what we can.

If you have patchy or incomplete time series that you do want to synchronize, then we do this in the observe phase as it is part of the data gathering and preprocessing stage of executing the manifest, not the compute phase. In this case, you "set the scene" early with correctly configured time series - even if we have to fall back to fill values because of failing importers etc, guaranteeing that if time sync is toggled on, then the compute phase will receive identically distributed data across all components and aggregate can work as it does today.

This has the additional benefit of syncing fewer parameters as most of the parameters that exist in a computed manifest are intermediates and results from compute plugins. As well as reducing the computational load on time-sync it also reduces the frequency of weird edge cases, because there are fewer parameters to handle.

The only config needed is start, stop and interval. This is applied globally so that by the end of observe we always have synchronized time series and we error out if that's somehow not true. Assuming the time stamps match across all components, it's fine to retroactively add a time index to each observation to support the visualizer if necessary.

Deeper dive into time-sync in observe

There's two ways this can happen.

sync first, then import

One is to do time-sync on the time series skeleton - i.e. before any data exists. it only generates an array of objects containing timestamp and duration. Then, the importers and other observe plugins have to generate input data for each of those moments. The benefit of this is that time-sync is easy and fast and it doesn't have to deal with any edge cases - the plugins have to return the right value given the timestamp and duration.

The downside is that we'd have to handle cases where the plugin can't return valid results for every timestamp - maybe its an API that has a fixed range that doesn't match, or an API failure. Somewhere, we have to code some logic to adjust the API return value to a new time slot, or return a fill value. maybe this happens in the importer plugin - we make it a requirement that an importer can handle any time range or interval, even if it's not a native feature of the API the plugin wraps. So this does add some burden to importer plugins to ensure they can handle most start, stop and interval values gracefully.

This makes some sense, as each plugin will typically only have to impose logic for a small subset of parameters, which is a much smaller lift than writing a universal time-sync protocol that generalizes to everything.

Another significant benefit is that this is robust to any weirdness with plugins migrating in and out of observe and compute (e.g. we have had situations where plugins we would expect to be in observe have had to be moved into compute so they can use data yielded from other plugins in the pipeline).

import first, then sync

The alternative is to run the importers first, then sync the time series after it is enriched with input data. This is effectively what happens now, and I think we are basically lucky that we don't have known examples of use-cases where there are multiple conflicting importers that have to sync within a single component. We have a time range and interval that we pass to each importer and capture the results in our inputs array. The time series can still be gappy, incomplete or uneven at this point, but we pass the data to compute anyway and let each plugin operate over the data it sees. Then later, we pass everything to time-sync to make sense of and redistribute into correct time slots to support aggregate. There's a plausible halfway-house where we run our observe plugins and time sync them before running any compute plugins.

The benefit of this is good UX and DX as it's all abstracted away from plugin builders and users. The downside is we have to maintain a universal time-syncer as we don't know what importers people will build and use.

summary

I could be convinced to drop time series support, but it will be a heavy lift to add it again later on, and I think it's an important differentiator that is worth the work to maintain. I disagree that the time series hasn't surfaced important observations - in your Vienna talk you identified a day with anomalously high netlify builds explaining a peak in our website's SCI score. That's a great example of time series yielding information that can lead to behavioural changes that can immediately reduce SCI!

jawache commented 1 month ago

@jmcook1186 so based on our conversation yesterday, my position now is that the base case use of IF should be one that doesn't need time-syncing/snapping but I agree that time-syncing does add a lot of value, but also a lot of complexity for the end user.

So basically there at least needs to be some steps in the journey, and if we have the attitude that time-syncing is mandatory, the current first step is really high. One way we've made time-syncing mandatory is the visualizer is mostly useless without time-synced data.

The conversation we've been having in this ticket was about making that first step easier, by making time-syncing easier, because time-syncing was seen by me as mandatory to getting any usefulness out of IF. But the other way to make the first step easier is just to say that the first step is without time-syncing, and time-syncing is an advanced feature, a second step in the journey.

It's subtle but it changes the priority of things and gives us a path. For instance it means that we really should make the visualizer more useful even without time-synced data, it also helps us to prioritize the training material, but it also makes me feel ok about time-syncing just being fundamentally a complex activity. It's step 2, more advanced, as a user you're going to have to think through some complex scenarios.

Also, BTW, reading through your last response "sync first, then import" feels to me the same as what I was proposing above - I suspect we're actually much closer aligned and need to just whiteboard it together.

jmcook1186 commented 4 weeks ago

@jawache @zanete

To summarise the outcomes of our conversation yesterday:

Assuming this only yields minor refinements to the existing time-sync plugin, we will maintain the current status quo, with small tweaks to the behaviour to ensure we support simpler manifests.

narekhovhannisyan commented 2 weeks ago

@jmcook1186 it seems the first point is fixed, tried to move the time sync to different positions in the pipeline (pipeline-with-mocks file), got the same result (checked by online yaml diff).

jmcook1186 commented 1 day ago

The current status of this is that we have confirmed time-sync can now run anywhere in a pipeline, but the position can lead to non-negligible differences in the final aggregated SCI value because it changes exactly what data is operated on by the time-sync resampling. We just need to decide what we want to recommend as the accepted best practise. My base case is that we ougt to recommend executing time-sync as the first step in the compute pipeline, but am open to counter-suggestions.

@jawache @narekhovhannisyan @manushak any opinions?

zanete commented 14 hours ago

all that remains is a decision on what we want to recommend as the best practice, due to the side effects. Confirm with @jawache that we should recommend it running first (more https://github.com/Green-Software-Foundation/if/issues/1057#issuecomment-2491413459) Then it becomes a sentence update in the docs