PyPSA / atlite

Atlite: A Lightweight Python Package for Calculating Renewable Power Potentials and Time Series
https://atlite.readthedocs.io
255 stars 87 forks source link

ERA5 Solar Position Time Shift Broken for Certain Time Spans #256

Closed zoltanmaric closed 1 year ago

zoltanmaric commented 1 year ago

Description

When building an ERA5 cutout for a period that doesn't cover entire months (e.g. for time=slice("2022-06-30 00:00", "2022-07-01 23:00")), the solar position is shifted by half a day instead of half an hour - effectively inverting the solar position (maximum at local midnight, minimum at local noon).

Expected Behavior

Solar position should be shifted by 30 minutes.

Actual Behavior

Solar position is shifted by 12 hours.

Example

import atlite
import matplotlib.pyplot as plt
from atlite.pv import solar_position

cutout = atlite.Cutout("/tmp/era5", module="era5", bounds=(-4, 56, 1.5, 62), time=slice("2022-06-30", "2022-07-01"))
cutout.prepare()

solar_altitude = solar_position.SolarPosition(cutout.data).sel(x=0, y=56).altitude.load()
solar_altitude.plot()
plt.savefig("/tmp/solar_altitude_wrong.png")

image

Your Environment

Related to #158 and #199

zoltanmaric commented 1 year ago

The culprit is this attempt to infer the frequency of the time index: https://github.com/PyPSA/atlite/blob/a0bd4b06c51269b87c90252bdf6951556f645dac/atlite/datasets/era5.py#L157-L174

An ERA5 CDS request spanning June 30 and July 1st looks like this:

{
  'product': 'reanalysis-era5-single-levels',
  'year': '2022',
  'month': [6, 7],
  'day': [30, 1],
  'time': ['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00']
}

which returns results covering June 1st, June 30th, July 1st, and July 30th:

Time index contents

From [`get_data_influx`](https://github.com/PyPSA/atlite/pull/199/files#diff-0558b068b3e760408c41c929b3de7a0d12939032de82f7de3a6aa7120733cea7R171) ```python print(ds.time) ``` Output: ``` array(['2022-06-01T00:00:00.000000000', '2022-06-01T01:00:00.000000000', '2022-06-01T02:00:00.000000000', '2022-06-01T03:00:00.000000000', '2022-06-01T04:00:00.000000000', '2022-06-01T05:00:00.000000000', '2022-06-01T06:00:00.000000000', '2022-06-01T07:00:00.000000000', '2022-06-01T08:00:00.000000000', '2022-06-01T09:00:00.000000000', '2022-06-01T10:00:00.000000000', '2022-06-01T11:00:00.000000000', '2022-06-01T12:00:00.000000000', '2022-06-01T13:00:00.000000000', '2022-06-01T14:00:00.000000000', '2022-06-01T15:00:00.000000000', '2022-06-01T16:00:00.000000000', '2022-06-01T17:00:00.000000000', '2022-06-01T18:00:00.000000000', '2022-06-01T19:00:00.000000000', '2022-06-01T20:00:00.000000000', '2022-06-01T21:00:00.000000000', '2022-06-01T22:00:00.000000000', '2022-06-01T23:00:00.000000000', '2022-06-30T00:00:00.000000000', '2022-06-30T01:00:00.000000000', '2022-06-30T02:00:00.000000000', '2022-06-30T03:00:00.000000000', '2022-06-30T04:00:00.000000000', '2022-06-30T05:00:00.000000000', '2022-06-30T06:00:00.000000000', '2022-06-30T07:00:00.000000000', '2022-06-30T08:00:00.000000000', '2022-06-30T09:00:00.000000000', '2022-06-30T10:00:00.000000000', '2022-06-30T11:00:00.000000000', '2022-06-30T12:00:00.000000000', '2022-06-30T13:00:00.000000000', '2022-06-30T14:00:00.000000000', '2022-06-30T15:00:00.000000000', '2022-06-30T16:00:00.000000000', '2022-06-30T17:00:00.000000000', '2022-06-30T18:00:00.000000000', '2022-06-30T19:00:00.000000000', '2022-06-30T20:00:00.000000000', '2022-06-30T21:00:00.000000000', '2022-06-30T22:00:00.000000000', '2022-06-30T23:00:00.000000000', '2022-07-01T00:00:00.000000000', '2022-07-01T01:00:00.000000000', '2022-07-01T02:00:00.000000000', '2022-07-01T03:00:00.000000000', '2022-07-01T04:00:00.000000000', '2022-07-01T05:00:00.000000000', '2022-07-01T06:00:00.000000000', '2022-07-01T07:00:00.000000000', '2022-07-01T08:00:00.000000000', '2022-07-01T09:00:00.000000000', '2022-07-01T10:00:00.000000000', '2022-07-01T11:00:00.000000000', '2022-07-01T12:00:00.000000000', '2022-07-01T13:00:00.000000000', '2022-07-01T14:00:00.000000000', '2022-07-01T15:00:00.000000000', '2022-07-01T16:00:00.000000000', '2022-07-01T17:00:00.000000000', '2022-07-01T18:00:00.000000000', '2022-07-01T19:00:00.000000000', '2022-07-01T20:00:00.000000000', '2022-07-01T21:00:00.000000000', '2022-07-01T22:00:00.000000000', '2022-07-01T23:00:00.000000000', '2022-07-30T00:00:00.000000000', '2022-07-30T01:00:00.000000000', '2022-07-30T02:00:00.000000000', '2022-07-30T03:00:00.000000000', '2022-07-30T04:00:00.000000000', '2022-07-30T05:00:00.000000000', '2022-07-30T06:00:00.000000000', '2022-07-30T07:00:00.000000000', '2022-07-30T08:00:00.000000000', '2022-07-30T09:00:00.000000000', '2022-07-30T10:00:00.000000000', '2022-07-30T11:00:00.000000000', '2022-07-30T12:00:00.000000000', '2022-07-30T13:00:00.000000000', '2022-07-30T14:00:00.000000000', '2022-07-30T15:00:00.000000000', '2022-07-30T16:00:00.000000000', '2022-07-30T17:00:00.000000000', '2022-07-30T18:00:00.000000000', '2022-07-30T19:00:00.000000000', '2022-07-30T20:00:00.000000000', '2022-07-30T21:00:00.000000000', '2022-07-30T22:00:00.000000000', '2022-07-30T23:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2022-06-01 ... 2022-07-30T23:00:00 Attributes: long_name: time ```

Because the time index is not continuous, pd.infer_freq returns None, resulting in a time shift of minus 12 hours.

zoltanmaric commented 1 year ago

I think the fix should be as simple as just always setting the time shift to 30 minutes (proposed diff).

The get_data function of era5.py already has the reanalysis-era5-single-levels hard-coded as the product: https://github.com/PyPSA/atlite/blob/a0bd4b06c51269b87c90252bdf6951556f645dac/atlite/datasets/era5.py#L352-L359

To my understanding, this product's resolution is always hourly (see docs), so there's no reason to attempt to infer a different frequency.

image

What do you think @euronion @FabianHofmann ?

zoltanmaric commented 1 year ago

Comparing ERA5 Values Requested at Different Time Samplings

Here's a comparison of values received from ERA5 at 1h, 2h, 3h, 4h, and 6h sampling:

image
Code to Generate the Above Table

```python import xarray import atlite.datasets.era5 as era5 from dask.utils import SerializableLock import functools # Create lists of hours like [00:00, 01:00, 02:00, ...], sampled # every hour, every 2 hours, etc. time_sampling = {} for rate in [1, 2, 3, 4, 6]: time_sampling[rate] = [f"{hour:02}:00" for hour in range(0, 24, rate)] retrieval_params = { 'product': 'reanalysis-era5-single-levels', 'area': [57.0, -0.5, 56.0, 0.5], 'chunks': {'time': 100}, 'grid': [0.25, 0.25], 'tmpdir': '/tmp', 'lock': SerializableLock(), 'year': '2013', 'month': [1], 'day': [1] } param_sets = {hour: {**retrieval_params, **{'time': time}} for hour, time in time_sampling.items()} def retrieve_data_for_single_raster(params: dict) -> "xarray.DataSet": variable = [ "surface_net_solar_radiation", "surface_solar_radiation_downwards", "toa_incident_solar_radiation", "total_sky_direct_solar_radiation_at_surface", ] ds = era5.retrieve_data(variable=variable, **params) return ds.sel(latitude=56, longitude=0).load().to_dataframe()[["ssr", "ssrd", "tisr", "fdir"]] # Retrieve ERA5 data for each different time sampling raw_ds = {hour: retrieve_data_for_single_raster(params) for hour, params in param_sets.items()} def join_dfs(left_sampling_and_df, right_sampling_and_df): left_sampling, left_df = left_sampling_and_df right_sampling, right_df = right_sampling_and_df suffix = f"_{right_sampling}h" return left_sampling, left_df.join(right_df, on='time', how='left', rsuffix=suffix) # Merge all data into a single dataframe to show differences in values # for each hour next to each other _, merged = functools.reduce(join_dfs, raw_ds.items()) samplings_compared = merged.sort_index(axis='columns')\ .query('time.dt.hour > 7 and time.dt.hour < 18')\ .astype(float)\ .round(decimals=2) # Remove date part samplings_compared.index = samplings_compared.index.time samplings_compared.to_csv('/tmp/samplings_compared.csv') ```

While this does show that the time shift is not proportional to the sampling, there's an additional weird twist to it. The values for 2h, 3h, 4h, and 6h sampling seem to be equal - but the values for 1h sampling are slightly different (by less than 1% at noon).

It would be interesting to find out why that is, but as far as this issue is concerned - I think it's still more appropriate to always shift the time by 30 minutes, than by half of the sampling interval.

FabianHofmann commented 1 year ago

@zoltanmaric I am impressed how fast you are catching up with the atlite code! Definitely makes sense!

FabianHofmann commented 1 year ago

Interesting, but I would not bother with the value differences for different time resolutions, since it is related to ERA5 internals only.