Resampling in `HistoricalSignal.forecast` off by 1 if outside available forecast data

Impelon commented 3 months ago

I'm back again with some more (from a user perspective) unexpected behaviour :relaxed:

Take any HistoricalSignal that includes separate forecast information, e.g. one of the solcast datasets.

>>> start = datetime.datetime.now()
>>> s = vessim.signal.HistoricalSignal.from_dataset("solcast2022_germany", params={"start_time": start})
>>> # The following works fine, it returns as much of the forecast as available, which is less than 70 minutes, but that is ok.
>>> s.forecast(start, start + datetime.timedelta(minutes=70), column="Berlin")
>>> # The following raises an exception even before checking for resample_method.
>>> s.forecast(start, start + datetime.timedelta(minutes=70), frequency="10min", column="Berlin")
Traceback (most recent call last):
    [...]
    if not np.array_equal(new_times, times[new_times_indices]) and resample_method != "bfill":
IndexError: index 13 is out of bounds for axis 0 with size 13
>>> # Same if resample_method is specified, naturally.
>>> s.forecast(start, start + datetime.timedelta(minutes=70), frequency="10min", resample_method="bfill", column="Berlin")
Traceback (most recent call last):
    [...]
    if not np.array_equal(new_times, times[new_times_indices]) and resample_method != "bfill":
IndexError: index 13 is out of bounds for axis 0 with size 13

The IndexError thus happens if a forecast of length longer than the one available is requested and frequency is specified. (Ok, the frequency also needs to be such that the resampled forecast would exceed the original in length - i.e. a frequency of 1h would work in the above, because it will just output one data point for the next hour and skip the next. But those frequency values that work are rather uncommon.) This can probably also happen without separate forecast data, just using the actual signal. It is just easier to trigger it here, as the length of the forecasts available is usually more limited.

If one looks into the new_times_indices, one notices that this is because the highest indices will be off by 1, because np.searchsorted rightly says the elements would be at an index later than the highest element available. For example

# For
x.forecast(start, start + datetime.timedelta(minutes=7000), frequency="1h", resample_method="bfill", column="Berlin")
# one gets `new_times_indices` of
[11 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13]

One could easily mask the confusing IndexError, for instance by dropping any indices that are above the length of the times array (here: 13).

kilianp14 commented 3 months ago

For the examples you provided, there actually are no wrong results (except for the fact that the error message are bad/confusing)

s.forecast(start, start + datetime.timedelta(minutes=70), column="Berlin") works as described in the docstring: "If frequency is not specified, all existing data in the window will be returned."
s.forecast(start, start + datetime.timedelta(minutes=70), frequency="10min", column="Berlin") should result in an error because there is no available data at start + 70 minutes (only 60min forecast)
s.forecast(start, start + datetime.timedelta(minutes=70), frequency="10min", resample_method="bfill", column="Berlin")should also result in an error because the missing data point at start + 70min can also not be determined using back-fill, as there are no data points after start + 70min

There is, however, unexpected behavior that becomes visible, when looking at a function call like: s.forecast(start, start + datetime.timedelta(minutes=70), frequency="10min", resample_method="ffill", column="Berlin") This should not result in an error, as the missing point can be forward-filled using available data. Here, the bad handling of the IndexError becomes a problem.

I will try to fix this in a few days. Thank you for notifying :)

Impelon commented 3 months ago

s.forecast(start, start + datetime.timedelta(minutes=70), column="Berlin") works as described in the docstring: "If frequency is not specified, all existing data in the window will be returned."

Agreed.

s.forecast(start, start + datetime.timedelta(minutes=70), frequency="10min", column="Berlin") should result in an error because there is no available data at start + 70 minutes (only 60min forecast)

As a user I'd find it a bit confusing that I would get an error when I specify frequency, but it runs fine if I don't. At the very least the docstring is not clear on this matter. Other than the missing documentation regarding the inconsistency with the normal behaviour, I think it would be fine to return a custom error as well, yes.

So if I read correctly that your preferred solution would be to raise a custom Exception in the cases that IndexError is raised now (except when resample_method="ffil"). As I said, I think that is an ok solution, too, but then in my opinion you should also add a way to programmatically get the length of the available forecast. Either through a new method that can be called with (start_time, end_time) before making a call to forecast, or with a custom Exception class that includes a max_end_time attribute or similar, so that if a call to forecast fails, it can be repeated with the correct end_time. Otherwise, it would be impossible to guarantee correct usage of HistoricalSignal.forecast in code, having no access to the underlying data. (I.e. if I built a component on top of a user-specified HistoricalSignal, I won't be able to tell what end_time is safe so that forecast does not fail.)

I will try to fix this in a few days.

Thank you :) Just as a side note: This issue is not urgent for me. I'm still using vessim 0.6.0, so I've built a workaround for this anyway.

dos-group / vessim

Resampling in `HistoricalSignal.forecast` off by 1 if outside available forecast data #229