Account for job start/end time not exactly matching forecast data points

tlestang commented 1 year ago

Currently, the average carbon intensity for a job is estimated over a time interval spanning $N + 1$ data points, where $N$ is the ceil of the ratio between the job duration and the timestep between data points. Moreover, cats approximates the job start time as the time of the first available data points. For carbonintensity.org.uk, this the previous top of the hour, or half hour.

example: current time is 12:48 and my job is 194 minutes long. The job start time will be approximated as 12:30, and the job duration will be approximation as $210min = 7 \times 30min$ where $30min$ is the interval between two data points.

The approach followed in the PR is similar to changes included in #43 , but the implementation fits within the current structure by only enhancing the WindowedForecast class. These changes are covered by a new test test_average_intensity_with_offset.

implementation: The average carbon intensity over a given time window is estimated by summing the intensity values at the midpoints between each consecutive data points in the window. The first midpoint (midpt[0]) is overidden to account for the fact that the first point (used to compute midpt[0]) should be located at the job start time. The corresponding intensity value is interpolated between the first and second data point. Similarly, the last midpoint (midpt[-1]) is overidden to account for the fast that the last point should be located at the job end time. The corresponding intensity value is interpolated between the penultimate and last data points in the window.

For a given candidate window, the first (last) data point is interpolated using the directly preceding (following) available data point before (after) the start (end) of the job.

edit 2023-07-24 13:20: Changed the implementation to handle short jobs correctly.

tlestang commented 1 year ago

Looks like this is only working with python3.10+, because I'm relying on being able to pass a sorting key to bisect. Probably worth working around this to be able to support 3.9 as well.

tlestang commented 1 year ago

what happens if somebody submits a task that lasts less than 30 mins?

Well that's a very good question, and I don't think this case was handled correctly.

In response to this I changed the implementation slightly so it 'just works' for short jobs, in other words it shouldn't be a corner case anymore. I'm adding a new test with a 6 minutes job to check that this is correct.

Details: instead of computing the mid-points over the (over-estimated) window and then fixing them, it now builds the window, interpolating both ends, and computes the midpoints as if nothing happened. Besides handling short jobs naturally, I think this actually makes the code a lot easier to understand.

Example: You run cats at 12:48 for a 6minutes job. The second candidate window is 13:18 to 13:24, located between data points at 13:00 (data[1]) and 13:30 (data[2]). Both carbon intensity values at 13:18 and 13:24 and interpolated by drawing a straight line between data points at 13:00 and 13:30.

I guess we could start fitting some smooth function to the CI data

I think linear interpolation is enough. If you look at the carbon intensity timeseries, it's already very smooth. In other words the signal doesn't exhibit large variations within 30 min. I guess forecast providers provide data points at an interval ensuring the timeseries is well resolved.

colinsauze commented 1 year ago

Example: You run cats at 12:48 for a 6minutes job. The second candidate window is 13:18 to 13:24, located between data points at 13:00 (data[1]) and 13:30 (data[2]). Both carbon intensity values at 13:18 and 13:24 and interpolated by drawing a straight line between data points at 13:00 and 13:30.

I guess we could start fitting some smooth function to the CI data

I think linear interpolation is enough. If you look at the carbon intensity timeseries, it's already very smooth. In other words the signal doesn't exhibit large variations within 30 min. I guess forecast providers provide data points at an interval ensuring the timeseries is well resolved.

I'd agree that linear interpolation is good enough here. Getting things within the right 30 minute period will achieve what we need in terms of emissions minimisation.

Part of me is tempted to say the best strategy for sub 30 minute jobs is to randomly determine when they run within the lowest 30 minute window available. This way if many people were launching small jobs they wouldn't all try to run at the same time (even when distributed across multiple unrelated systems).

tlestang commented 1 year ago

Part of me is tempted to say the best strategy for sub 30 minute jobs is to randomly determine when they run within the lowest 30 minute window available. This way if many people were launching small jobs they wouldn't all try to run at the same time (even when distributed across multiple unrelated systems).

Smart. I don't really want to add more features to this PR but that's probably a good starting point for a new one. Or an issue.

colinsauze commented 1 year ago

Yes, it's definitely something for a different PR/issue and not critical for getting a minimum viable product ready.

tlestang commented 1 year ago

Thanks for testing @Llannelongue because I think you've exposed a mistake! In getitem

        return CarbonIntensityAverageEstimate(
            start=window_start,
            end=window_end,
            value=sum(midpt) / (self.ndata - 1),
        )

dividing by self.ndata - 1 assumes all midpoints are regularly spaced: true for inner points but not the first and last one, based on interpolated data points. I've also made this assumption in the tests I believe... :face_with_head_bandage:

tlestang commented 1 year ago

I changed the average calculation to account for the difference in weights. @Llannelongue I added a test case based on your test above.

I believe the integral between 9:15 and 10:25 should be 49.9.., with the left bound 45 and right bound 30.83..)

Agreed. I think it is now ;)

tlestang commented 1 year ago

Thank you @andreww @colinsauze and @Llannelongue for reviewing this. It really helped make this better!

GreenScheduler / cats

Account for job start/end time not exactly matching forecast data points #54