probabilistic persistence forecasts are broken

wholmgren commented 3 years ago

The hour ahead probabilistic persistence forecasts are not working correctly. The graphs below show that the forecast is the same for all constant values, and this extends through the entire forecast period:

newplot(48)

constant value = 0%

bokeh_plot(68)

constant value = 100%

bokeh_plot(69)

We added the probabilistic persistence forecast generation function in #490. I suspect it worked at that time. If something broke it in the meantime, look at #497, #529, and #617

wholmgren commented 3 years ago

The issue is that these reference forecasts ultimately use persistence.persistence_probabilistic with a single resampled data point. This function is passed 1 hour of 1 minute data and then it's immediately resampled in a single 1 hour data point. For each constant_value, we call np.percentile on the single point, so we get the same number again and again.

Instead, I think we want to use persistence_probabilistic_timeofday and provide 20+ days of data to the function. I believe this can be done by hacking utils.get_data_start_end with something like

elif isinstance(forecast, datamodel.ProbabilisticForecast):
    data_start, data_end = _prob_timeofday_start_end(
            issue_time, forecast)

I don't see any discussion of that in #490, so maybe just an oversight because we had a lot going on then. @alorenzo175 do you recall any choice to use persistence_probabilistic over persistence_probabilistic_timeofday?

Also note the reference probabilistic persistence forecasts'extra_parameters wrongly contains "index_persistence": true.

Finally, there's some nuance when we have mismatched observation and forecast interval labels. Say observation.interval_label='ending' and forecast.interval_label='beginning'. The workflow in run_persistence leads to observation values from, say, 12:00:00 through 13:00:00 being passed to the resample method. So now you have two resampled points (1 from 12:00 and 1 from 12:01-13:00) passed to np.percentile. This tripped me up with I was trying to understand why it appeared that my test forecasts worked while the production forecasts did not - switching my test forecasts to interval_label='ending' for consistency with the obs led to reproducible behavior. The non-probabilistic persistence functions also end up with an extra bit of data in their results, but it's a minor rather than major error there. I don't have a good idea for what to do about that at this time and it can probably be dealt with separately.

alorenzo175 commented 3 years ago

Instead, I think we want to use persistence_probabilistic_timeofday and provide 20+ days of data to the function

I'm pretty sure I chose persistence_probabilistic to avoid having to fetch 20+ days of data vs a single day, and persistence_probabilistic may be a better "reference" for arbiter purposes.

I think the immediate resampling came about from https://github.com/SolarArbiter/solarforecastarbiter-core/pull/434#issuecomment-637776892. While I agree that the statistics of 1 minute data and 1 hour data is different, I'm not sure I agree that resampling to the forecast interval makes sense. Any persistence forecast will depend on how the data is measured, e.g. 1 hour standard persistence as mean of 5 minute instantaneous data vs 15 minute average data.

wholmgren commented 3 years ago

Doubleday et al section 4.3 has a nice summary of the problem plus reasonable recommendations. I don't see a non-technical reason that the arbiter purposes should differ from these recommendations (I'm not aware of any end users that implement their own persistence ensembles).

If it's a problem to fetch 20 days of data could we make new Observations with 1 hour intervals and base the persistence forecasts on those? Since the operational practice is to process a day of data at a time, we'd actually end up with ~3x smaller fetches.

Dou20 Benchmark probabilistic solar forecasts Characteristics and recommendations.pdf

alorenzo175 commented 3 years ago

From my reading of Doubleday et al. 4.3, we should remove the resampling from persistence_probabilistic (primarily for intra-hour PeEn).

The reference prob. pers. forecasts are currently describe as hour-ahead which I think somewhat fits as what Doubleday et al. describes as an intra-hourly PeEn and should use persistence_probabilistic without resampling, but the problem is that the 1 hour interval doesn't really make sense.

I now agree we should probably have a reference PeEn using persistence_probabilistic_timeofday with 1 hour interval and 24 h run length. Then I'm less concerned with fetching the previous 20 days of data since you only need to do it once per day vs every hour. We should delete the current "hour-ahead" reference prob. persistence forecasts and find a new suitable name. I would also follow Doubleday et al. and ignore any missing data. I guess lead_time_to_start = 0?

wholmgren commented 3 years ago

Hm, I'm worried I'm missing something...

Doubleday et al. Table 2 says "intra-hourly" forecasts had 5 minute intervals, 5 minute lead time, and 1 hour run length. I didn't see it in the paper, but the preprocess_SURFRAD.R file makes it clear that the raw SURFRAD .dat files with 1 minute observations are resampled into 5 minute averages for the intra-hour forecasts. Then section 4.3 describes how the last 2 hours of 5 minute average of CSI "observations" (n=24) form the forecast CDF of CSI for all intervals within the hour.

For the "hourly" forecasts, Doubleday et al. first resamples the 1 minute observations into hourly averages. These forecasts have a 1 hour lead time and 6 hour run length, though for the "time of day" persistence, the details of the lead time and run length are unimportant (at least within a day).

So I think that is all consistent with the idea that persistence_probabilistic should resample the input data to match the forecast interval length. But I don't really follow your 2nd paragraph so maybe I misunderstood.

I now agree we should probably have a reference PeEn using persistence_probabilistic_timeofday with 1 hour interval and 24 h run length... I guess lead_time_to_start = 0?

I think this is a good way to describe it. Wouldn't it also be accurate to describe it as lead_time_to_start = 24 hr, run_length = 1 hr? If SURFRAD data is only fetched once per day, and we check if new forecasts can be made based on the results of API.get_observation_time_range, then ultimately I don't think there's an operational difference. For better or worse, there would be a difference if SURFRAD data could be made available more frequently.

I would also follow Doubleday et al. and ignore any missing data.

I agree. I'm not sure about possible complications for a day or two of expected data lag.

alorenzo175 commented 3 years ago

Ah, I missed that data was not initially in 5 minute intervals so I assumed that there was no resampling. Without the code, I also didn't think much about resampling for "time of day" vs using the value at the hour.

I think this is a good way to describe it. Wouldn't it also be accurate to describe it as lead_time_to_start = 24 hr, run_length = 1 hr?

I think so, but we've defined intra-day as run_length < 1d but I wouldn't call this an intraday forecast. Unfortunately you really can't understand the persistence forecasts without looking at how run_persistence behaves. What a mess.

I agree. I'm not sure about possible complications for a day or two of expected data lag.

I think the persistence forecasts are only made once data is available at the end of the period required for persisting. So if data is two days behind today (say endpoint of 2021-01-05T12:00), the forecast would only have been made up to 2021-01-06T12:00.

wholmgren commented 3 years ago

I don't see any issue with changing the definition of intraday to forecast.lead_time_to_start < pd.Timedelta('1d'). Might make a similar change to create_persistence_forecasts or just use the reference_forecasts.utils._is_intraday function... or make that function public in utils (what a mess).

Of course, it doesn't solve the problem of easily understanding how the reference persistence forecasts work, but I think it would be slightly more clear.

At risk of scope creep... what if we add more extra parameters to the Forecast objects that make the processing more explicit? We already have "index_persistence": true|false. I think something like "model" with values persistence_probabilistic, persistence_probabilistic_timeofday, persistence_scalar_index, persistence_scalar, or persistence_interval. Easy enough for the new prob forecasts, but I'm guessing that requires a database migration for the rest of the persistence forecasts. run_nwp includes a model argument, so I think we'd do the same for run_persistence.

Ugh now I also want to remove persistence_ from all of those function names in persistence.py.

SolarArbiter / solarforecastarbiter-core

probabilistic persistence forecasts are broken #639