option to fill missing forecasts with other values

wholmgren commented 4 years ago

We need a way to penalize missing forecasts to fully support trials and other operational evaluations.

One implementation could look like this:

Reports get a parameter like missing_forecasts = 'ignore'|'clearsky'|0. Needs to be exposed on dashboard too.
missing_forecasts is passed to preprocessing.process_forecast_observations, along with the report start and end.
Before the call the resample_and_align, if missing_forecasts != 'ignore': fx_series_filled = fill_values(forecast, fx_series, missing_forecasts, start, end). forecast metadata, start, end determines the full DatetimeIndex. If we don't want to support fill with clearsky then this could just be a call to reindex and fillna. Repeat for reference forecast. Store some information about number of filled points in validation_results dictionary.
resample_and_align remains unchanged (I think) and excludes forecast values for which observations are missing/flagged.

Seems too doable so I'm afraid I'm missing something.

wholmgren commented 4 years ago

Probabilistic and event forecasts will need some kind of fill too.

For event forecasts, I suggest fill with all 0/False or 1/True (depending on the data type where the fill occurs). Which value to choose might depend on the situation. For rare events you might want to fill with 1 for a bigger missing penalty.

The strategy for probabilistic forecasts might depend on exactly what you're evaluating.

For a ProbabilisticForecast(axis='x'), maybe 100% probability in the smallest variable value (presumably but not necessarily 0); and for a ProbabilisticForecast(axis='y'), maybe 0 for all percentiles.

For a ProbabilisticForecastConstantValue(axis='x'), maybe always 0%; and for ProbabilisticForecastConstantValue(axis='y') always 0.

wholmgren commented 4 years ago

@awig

awig commented 4 years ago

@wholmgren Working on this now. I think the plan for the deterministic forecasts sounds good. Event forecasts seems pretty straightforward as well. I'm not sure I understand the proposition on the probabilistic forecasts.

Some questions/thoughts:

Order of Operations I'm thinking applying the forecast fillin after the observation validation apply_validation. I don't think it should matter either way but let me know if you have thoughts.
Reference Forecasts If there is a reference, I assume we should apply the same fill method? Should any logic be done to make sure they are filled for the same timestamps with the same values but
What fill types to support let me know how much you want to stick with pandas terminology with name and action:
- 'fill_forward' (pandas ffill/pad) - propagate forward in time the last valid value until next valid value
- 'back_fill' (pandas bfill/backfill) - fill backward in time the next valid value until last valid value
- 'ignore' - drop missing values, similar to exclude used for quality flags right now
- 'clearsky' - fill with corresponding clearsky from pvlib
- (string or float) - fill all missing values with static numeric value (typically `0`)
I can focus on all the others and skip clearsky to the end and decide to incorporate it in case there are problems.
EventForecast why not let the same fill methods apply except clearsky and restrict numeric values to only be 0 or 1?
ProbabilisticForecast/ProbabilisticForecastConstantValue not following on your ideas. I would think apply the same as above, except clearsky. I'm certain I'm missing something.
Work plan
- deterministic
- event
- probabilistic

wholmgren commented 4 years ago

Order of Operations I'm thinking applying the forecast fillin after the observation validation apply_validation. I don't think it should matter either way but let me know if you have thoughts.

I agree that I don't think it matters.

Reference Forecasts If there is a reference, I assume we should apply the same fill method? Should any logic be done to make sure they are filled for the same timestamps with the same values but

Yeah, I think it makes sense to apply the same fill method to the reference forecast. Maybe there are cases where you wouldn't want to do that, but I don't think we can make it an option without a lot of work.

What fill types to support let me know how much you want to stick with pandas terminology with name and action:
'fill_forward' (pandas ffill/pad) - propagate forward in time the last valid value until next valid value
'back_fill' (pandas bfill/backfill) - fill backward in time the next valid value until last valid value
'ignore' - drop missing values, similar to exclude used for quality flags right now
'clearsky' - fill with corresponding clearsky from pvlib
(string or float) - fill all missing values with static numeric value (typically 0)
I can focus on all the others and skip clearsky to the end and decide to incorporate it in case there are problems.

I don't have much preference for 'ffillvsfill_forward` at this layer of the code. I don't think backfill makes sense in a forecasting scenario. Perhaps 'ignore' should be 'drop' for clarity? Let's skip clear sky for now, but make sure the design is extensible enough to support it later.

EventForecast why not let the same fill methods apply except clearsky and restrict numeric values to only be 0 or 1?

That's fine with me.

ProbabilisticForecast/ProbabilisticForecastConstantValue not following on your ideas. I would think apply the same as above, except clearsky. I'm certain I'm missing something.

Yes for fill forward, ignore/drop. For fill with float for ProbabilisticForecastConstantValue, it's the same at a low level (it's just a float), but the meaning of the float is different depending on the axis, so the defaults are up for debate. For ProbabilisticForecast you have to worry about filling one or more (or always all?) of the constituent ProbabilisticForecastConstantValue. One might argue that if any elements of a ProbabilisticForecast at a given time are missing then all elements at that time must be replaced. We could also just not support the fill with float option for probabilistic forecast types.

awig commented 4 years ago

Perhaps 'ignore' should be 'drop' for clarity?

Sounds good to me

so the defaults are up for debate.

I see, where should the defaults be set in metrics or in the reports?

One might argue that if any elements of a ProbabilisticForecast at a given time are missing then all elements at that time must be replaced.

I guess there could be another option for ProbabilisticForecast that would fill in between quantiles (say linear for simplicity between the closest quantiles). But that seems error prone and more difficult if multiple missing for multiple quantiles. But, that makes me think of other options.

average : I don't really like to but what about fill with average, obviously tailored to get certain statistical properties for normal distributions.
interpolation : maybe just linear or spline

What do you think would be useful and what kind of phenomenon are we targeting: short sporadic missing values or wanting to compare forecasts where one has large gaps compared to the other or just that the resolution of one forecast doesn't match the other?

wholmgren commented 4 years ago

where should the defaults be set in metrics or in the reports?

Good question. Maybe let's leave it for the reports or dashboard.

As for probabilistic forecasts, I am leaning more towards only allowing fill forward and drop. I had some similar thoughts to yours but thought it would be too much processing to explain to a user.

awig commented 4 years ago

Great. I've started working on getting the fills working in their own function preprocessing.apply_fill first and then figure out where to check for the forecast type and integrate into preprocessing.process_forecast_observations.

awig commented 4 years ago

Oh, and on the forward fill, if the data is missing prior to the start timestamp I was going to fill with the default. I'll try adding it to the report but we can replace it if desired.

awig commented 4 years ago

Preference if missing_forecast in ReportParameters has a default or not? If so, whcih should it be?

awig commented 4 years ago

@wholmgren Just want to confirm for drop for ProbabilisticForecast that we need to drop across "columns"/quantiles. For now, the fill forward method should we do the same fill applies across "columns"/quantiles, right now i'm not doing this. What do you think?

wholmgren commented 4 years ago

For drop, if any value in a ProbabilisticForecast is missing then all values should be dropped at that time. I think it should be safe to apply fill forward on a column by column basis.

awig commented 4 years ago

For drop, if any value in a ProbabilisticForecast is missing then all values should be dropped at that time. I think it should be safe to apply fill forward on a column by column basis.

Sounds good. Thanks

awig commented 4 years ago

Do you mind how if the PreprocessingResult for ProbabilisticForecast reports per ProbabilisticForecastConstantValue or just overall? Right now I'm just working on reporting the overall total of fills/drops. This would also be consistent with what is reported from resample_and_align as well.

wholmgren commented 4 years ago

It the input forecast type is ProbabilisticForecast then ok to report total rather than separating by the constituent constant values.

On Thu, Jun 25, 2020 at 4:45 PM Adam Wigington notifications@github.com wrote:

Do you mind how if the PreprocessingResult for ProbabilisticForecast reports per ProbabilisticForecastConstantValue or just overall? Right now I'm just working on reporting the overall total of fills/drops

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SolarArbiter/solarforecastarbiter-core/issues/423#issuecomment-649871275, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBOERZDCFTITSSLJ7YF57TRYPORVANCNFSM4MRKKMJA .

SolarArbiter / solarforecastarbiter-core

option to fill missing forecasts with other values #423