Rethink Figure 3b that compares Relative performance across prediction interval ranges

nikosbosse commented 2 years ago

Current Figure stratifies scores by interval range and then plots score(surrogate)/score(ensemble).

I argue that this is meaningless because a) I think you would generally observe this pattern pretty much regardless of any model you compare. I tried it for epinow2 vs. ensemble and got the same results b) values get downweighted with increasing interval range. This means that even if relative scores diverge, this doesn't actually influence the overall relative scores (where you first sum up across all ranges and then divide).

Replication for epinow2 vs. ensemble below. My last plot filters for horizon 2 and shows the scores for ensemble vs. epinow2 without any division.

![Uploading image.png…]()

nikosbosse commented 2 years ago

picture upload seems slightly broken...

seabbs commented 1 year ago

I've modified some of the language (https://github.com/epiforecasts/simplified-forecaster-evaluation/commit/80001790a93cba64fb42a496dbff05a9ef051e6b) to avoid drawing strong conclusions. I think this is still somewhat useful. Some pushback on your points above:

a) I think you would generally observe this pattern pretty much regardless of any model you compare. I tried it for epinow2 vs. ensemble and got the same results

This is a poor example as these models have exactly the dynamics we are discussing in the paper (where the ensemble under cover at high intervals and epinow2 doesn't and gets penalised for it).

b) values get downweighted with increasing interval range. This means that even if relative scores diverge, this doesn't actually influence the overall relative scores (where you first sum up across all ranges and then divide).

This is a plot of relative weighted interval score and not relative interval score. I agree that the values are small in the tails so you cannot sum the relative weighted interval score by interval to get the overall relative interval score but I still think it contains information on where the relative difference between models is happening.

If you can think of a better way of exploring the role of central vs tail forecasts very happy to switch to that. At the moment trying to achieve this by comparing point forecasts and using this relative comparison then drawing some pretty vague conclusions.

Potentially for clarity, and if no other way to think about this comes up, could include a plot of the interval score as you have done above and then highlight its a down weighted as interval increases?

sbfnk commented 1 year ago

How about plotting absolute instead of relative differences in the weighted interval score to avoid this issue?

seabbs commented 1 year ago

How about plotting absolute instead of relative differences in the weighted interval score to avoid this issue?

I don't think it avoids it if we want to discuss relative differences. I think plotting both might be useful.

seabbs commented 1 year ago

I'm leaving as is for now. Can discuss more based on the next comment round.

seabbs commented 1 year ago

I think this is better handled now. Feel free to reopen if you disagree after reading the new version.

epiforecasts / simplified-forecaster-evaluation

Rethink Figure 3b that compares Relative performance across prediction interval ranges #8