Closed wholmgren closed 4 years ago
Probabilistic and event forecasts will need some kind of fill too.
For event forecasts, I suggest fill with all 0
/False
or 1
/True
(depending on the data type where the fill occurs). Which value to choose might depend on the situation. For rare events you might want to fill with 1 for a bigger missing penalty.
The strategy for probabilistic forecasts might depend on exactly what you're evaluating.
For a ProbabilisticForecast(axis='x')
, maybe 100% probability in the smallest variable value (presumably but not necessarily 0); and for a ProbabilisticForecast(axis='y')
, maybe 0 for all percentiles.
For a ProbabilisticForecastConstantValue(axis='x')
, maybe always 0%; and for ProbabilisticForecastConstantValue(axis='y')
always 0.
@awig
@wholmgren Working on this now. I think the plan for the deterministic forecasts sounds good. Event forecasts seems pretty straightforward as well. I'm not sure I understand the proposition on the probabilistic forecasts.
Some questions/thoughts:
Order of Operations I'm thinking applying the forecast fillin after the observation validation apply_validation
. I don't think it should matter either way but let me know if you have thoughts.
Reference Forecasts If there is a reference, I assume we should apply the same fill method? Should any logic be done to make sure they are filled for the same timestamps with the same values but
What fill types to support let me know how much you want to stick with pandas terminology with name and action:
ffill
/pad
) - propagate forward in time the last valid value until next valid valuebfill
/backfill
) - fill backward in time the next valid value until last valid valueI can focus on all the others and skip clearsky to the end and decide to incorporate it in case there are problems.
EventForecast why not let the same fill methods apply except clearsky and restrict numeric values to only be 0 or 1?
ProbabilisticForecast/ProbabilisticForecastConstantValue not following on your ideas. I would think apply the same as above, except clearsky. I'm certain I'm missing something.
Work plan
Order of Operations I'm thinking applying the forecast fillin after the observation validation apply_validation. I don't think it should matter either way but let me know if you have thoughts.
I agree that I don't think it matters.
Reference Forecasts If there is a reference, I assume we should apply the same fill method? Should any logic be done to make sure they are filled for the same timestamps with the same values but
Yeah, I think it makes sense to apply the same fill method to the reference forecast. Maybe there are cases where you wouldn't want to do that, but I don't think we can make it an option without a lot of work.
What fill types to support let me know how much you want to stick with pandas terminology with name and action:
'fill_forward' (pandas ffill/pad) - propagate forward in time the last valid value until next valid value 'back_fill' (pandas bfill/backfill) - fill backward in time the next valid value until last valid value 'ignore' - drop missing values, similar to exclude used for quality flags right now 'clearsky' - fill with corresponding clearsky from pvlib (string or float) - fill all missing values with static numeric value (typically 0)
I can focus on all the others and skip clearsky to the end and decide to incorporate it in case there are problems.
I don't have much preference for 'ffillvs
fill_forward` at this layer of the code. I don't think backfill makes sense in a forecasting scenario. Perhaps 'ignore' should be 'drop' for clarity? Let's skip clear sky for now, but make sure the design is extensible enough to support it later.
EventForecast why not let the same fill methods apply except clearsky and restrict numeric values to only be 0 or 1?
That's fine with me.
ProbabilisticForecast/ProbabilisticForecastConstantValue not following on your ideas. I would think apply the same as above, except clearsky. I'm certain I'm missing something.
Yes for fill forward, ignore/drop. For fill with float for ProbabilisticForecastConstantValue
, it's the same at a low level (it's just a float), but the meaning of the float is different depending on the axis, so the defaults are up for debate. For ProbabilisticForecast
you have to worry about filling one or more (or always all?) of the constituent ProbabilisticForecastConstantValue
. One might argue that if any elements of a ProbabilisticForecast
at a given time are missing then all elements at that time must be replaced. We could also just not support the fill with float option for probabilistic forecast types.
Perhaps 'ignore' should be 'drop' for clarity?
Sounds good to me
so the defaults are up for debate.
I see, where should the defaults be set in metrics or in the reports?
One might argue that if any elements of a
ProbabilisticForecast
at a given time are missing then all elements at that time must be replaced.
I guess there could be another option for ProbabilisticForecast
that would fill in between quantiles (say linear for simplicity between the closest quantiles). But that seems error prone and more difficult if multiple missing for multiple quantiles. But, that makes me think of other options.
average
: I don't really like to but what about fill with average, obviously tailored to get certain statistical properties for normal distributions.
interpolation
: maybe just linear or spline
What do you think would be useful and what kind of phenomenon are we targeting: short sporadic missing values or wanting to compare forecasts where one has large gaps compared to the other or just that the resolution of one forecast doesn't match the other?
where should the defaults be set in metrics or in the reports?
Good question. Maybe let's leave it for the reports or dashboard.
As for probabilistic forecasts, I am leaning more towards only allowing fill forward and drop. I had some similar thoughts to yours but thought it would be too much processing to explain to a user.
Great. I've started working on getting the fills working in their own function preprocessing.apply_fill
first and then figure out where to check for the forecast type and integrate into preprocessing.process_forecast_observations
.
Oh, and on the forward fill, if the data is missing prior to the start
timestamp I was going to fill with the default. I'll try adding it to the report but we can replace it if desired.
Preference if missing_forecast
in ReportParameters
has a default or not? If so, whcih should it be?
@wholmgren Just want to confirm for drop
for ProbabilisticForecast
that we need to drop across "columns"/quantiles. For now, the fill forward method should we do the same fill applies across "columns"/quantiles, right now i'm not doing this. What do you think?
For drop, if any value in a ProbabilisticForecast
is missing then all values should be dropped at that time. I think it should be safe to apply fill forward on a column by column basis.
For drop, if any value in a
ProbabilisticForecast
is missing then all values should be dropped at that time. I think it should be safe to apply fill forward on a column by column basis.
Sounds good. Thanks
Do you mind how if the PreprocessingResult
for ProbabilisticForecast
reports per ProbabilisticForecastConstantValue
or just overall? Right now I'm just working on reporting the overall total of fills/drops. This would also be consistent with what is reported from resample_and_align
as well.
It the input forecast type is ProbabilisticForecast then ok to report total rather than separating by the constituent constant values.
On Thu, Jun 25, 2020 at 4:45 PM Adam Wigington notifications@github.com wrote:
Do you mind how if the PreprocessingResult for ProbabilisticForecast reports per ProbabilisticForecastConstantValue or just overall? Right now I'm just working on reporting the overall total of fills/drops
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SolarArbiter/solarforecastarbiter-core/issues/423#issuecomment-649871275, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBOERZDCFTITSSLJ7YF57TRYPORVANCNFSM4MRKKMJA .
We need a way to penalize missing forecasts to fully support trials and other operational evaluations.
One implementation could look like this:
missing_forecasts = 'ignore'|'clearsky'|0
. Needs to be exposed on dashboard too.missing_forecasts
is passed topreprocessing.process_forecast_observations
, along with the reportstart
andend
.resample_and_align
,if missing_forecasts != 'ignore': fx_series_filled = fill_values(forecast, fx_series, missing_forecasts, start, end)
.forecast
metadata,start
,end
determines the fullDatetimeIndex
. If we don't want to support fill with clearsky then this could just be a call toreindex
andfillna
. Repeat for reference forecast. Store some information about number of filled points invalidation_results
dictionary.resample_and_align
remains unchanged (I think) and excludes forecast values for which observations are missing/flagged.Seems too doable so I'm afraid I'm missing something.