epiforecasts / scoringutils

Utilities for Scoring and Assessing Predictions
https://epiforecasts.io/scoringutils/
Other
48 stars 20 forks source link

853: add CRPS decomposition #854

Closed sbfnk closed 3 months ago

sbfnk commented 3 months ago

Description

This PR closes #853.

library("scoringutils")
#> Note: scoringutils is currently undergoing major development changes (with an update planned for the first quarter of 2024). We would very much appreciate your opinions and feedback on what should be included in this major update: https://github.com/epiforecasts/scoringutils/discussions/333
score(as_forecast(example_sample_discrete))
#> ℹ Some rows containing NA values may be removed. This is fine if not
#>   unexpected.
#>      location location_name target_end_date target_type forecast_date
#>        <char>        <char>          <Date>      <char>        <Date>
#>   1:       DE       Germany      2021-05-08       Cases    2021-05-03
#>   2:       DE       Germany      2021-05-08       Cases    2021-05-03
#>   3:       DE       Germany      2021-05-08       Cases    2021-05-03
#>   4:       DE       Germany      2021-05-08      Deaths    2021-05-03
#>   5:       DE       Germany      2021-05-08      Deaths    2021-05-03
#>  ---                                                                 
#> 883:       IT         Italy      2021-07-24      Deaths    2021-07-12
#> 884:       IT         Italy      2021-07-24      Deaths    2021-07-05
#> 885:       IT         Italy      2021-07-24      Deaths    2021-07-12
#> 886:       IT         Italy      2021-07-24      Deaths    2021-07-05
#> 887:       IT         Italy      2021-07-24      Deaths    2021-07-12
#>                      model horizon  bias       dss        crps overprediction
#>                     <char>   <num> <num>     <num>       <num>          <num>
#>   1: EuroCOVIDhub-ensemble       1  0.55 20.601389  7482.96188        2352.35
#>   2: EuroCOVIDhub-baseline       1  0.95 22.443381 20371.25500       15823.10
#>   3:  epiforecasts-EpiNow2       1  0.80 22.310571 24810.45937       17328.65
#>   4: EuroCOVIDhub-ensemble       1  0.20 11.112022    67.53063          11.20
#>   5: EuroCOVIDhub-baseline       1 -0.05 11.541083    86.47312           0.00
#>  ---                                                                         
#> 883: EuroCOVIDhub-baseline       2  0.45 12.913523    66.54000          14.75
#> 884:       UMass-MechBayes       3  0.10  6.357686     6.64750           0.45
#> 885:       UMass-MechBayes       2  0.80  8.769176    29.52313          21.30
#> 886:  epiforecasts-EpiNow2       3  0.10 10.283440    28.84250           0.55
#> 887:  epiforecasts-EpiNow2       2  0.65 10.563036    49.95750          27.40
#>      underprediction  dispersion log_score        mad ae_median      se_mean
#>                <num>       <num>     <num>      <num>     <num>        <num>
#>   1:             0.0 5130.611875 10.989569 17640.7161    9414.0 2.303967e+08
#>   2:             0.0 4548.155000 11.941139 19341.9996   29379.0 9.128526e+08
#>   3:             0.0 7481.809375 12.007010 32348.8494   36513.0 1.603612e+09
#>   4:             0.0   56.330625  6.516994   266.8680      77.0 8.588656e+03
#>   5:             0.2   86.273125  6.913983   397.3368      27.5 2.118301e+03
#>  ---                                                                        
#> 883:             0.0   51.790000  6.067690   168.2751      58.5 6.063906e+04
#> 884:             0.0    6.197500  4.246340    25.2042       5.0 6.084000e+01
#> 885:             0.0    8.223125  5.326167    40.0302      42.5 2.159926e+03
#> 886:             0.0   28.292500  5.708785   108.2298       8.0 3.214890e+03
#> 887:             0.0   22.557500  5.735026    93.4038      73.5 1.262252e+04

Created on 2024-07-11 with reprex v2.1.0

Checklist

sbfnk commented 3 months ago

if we call it overprediction_sample here then maybe it should also be called overprediction_quantile for the WIS?

Wouldn't the better solution within the existing package framework be to define them as S3 methods for the different forecast types?

nikosbosse commented 3 months ago

Wouldn't the better solution within the existing package framework be to define them as S3 methods for the different forecast types?

the problem is that this function is called on a vector/matrix so that's not directly amenable to S3.

Are the CRPS overprediction/underprediction values equivalent to those of the WIS? I.e. if I took a sample-based forecast and converted it into a sample-based forecast, should I expect the overprediction/underprediction/dispersion values to roughly match? (I was wondering whether it would make sense to call the output column something like underprediction_wis/underprediction_crps (or maybe wis_underprediction...)).

sbfnk commented 3 months ago

My suggest would be:

As for your question, yes, they should be the same subject to the information reduction from converting to quantiles, and this seems largely the case:

library("scoringutils")
#> Note: scoringutils is currently undergoing major development changes (with an update planned for the first quarter of 2024). We would very much appreciate your opinions and feedback on what should be included in this major update: https://github.com/epiforecasts/scoringutils/discussions/333
library("tidyr")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("ggplot2")

nreplicates <- 30
nsamples <- 1000

observed <- rnorm(nreplicates, mean = seq_len(nreplicates))
predicted <- replicate(
  nsamples, rnorm(n = nreplicates, mean = seq_len(nreplicates))
)
crps <- crps_sample(
  observed = observed,
  predicted = predicted
)

dcrps <- dispersion_sample(observed, predicted)
ocrps <- overprediction_sample(observed, predicted)
ucrps <- underprediction_sample(observed, predicted)

levels <- seq(0.01, 0.99, by = 0.01)
quantiles <- t(apply(predicted, 1, quantile, probs = levels))

dwis <- dispersion(observed, quantiles, levels)
owis <- overprediction(observed, quantiles, levels)
uwis <- underprediction(observed, quantiles, levels)

df <- bind_rows(
  data.frame(
    metric = "CRPS",
    dispersion = dcrps,
    overprediction = ocrps,
    underprediction = ucrps
  ),
  data.frame(
    metric = "WIS",
    dispersion = dwis,
    overprediction = owis,
    underprediction = uwis
  )
) |>
  group_by(metric) |>
  mutate(replicate = seq_len(n())) |>
  ungroup() |>
  pivot_longer(c(-metric, -replicate)) |>
  pivot_wider(names_from = metric)

ggplot(df, aes(x = CRPS, y = WIS, colour = name)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1") +
  theme_minimal() +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed")

Created on 2024-07-16 with reprex v2.1.0

nikosbosse commented 3 months ago

My suggest would be:

  • rename functions to underprediction_quantile etc.
  • keep columns as they are

Sounds good to me! Also I think it would be nice to mention in the docs that the components correspond to the WIS components.

Let me know if you're fine with making these changes - alternatively I'm also happy to take over and do the leg work.

sbfnk commented 3 months ago

Over to you!

seabbs commented 3 months ago

the problem is that this function is called on a vector/matrix so that's not directly amenable to S3.

I don't understand this point? the input arg of an s3 method (i.e x etc) can change type to be whatever you want. If it couldn't how would summary etc work

I'd much prefer this to be done via s3 vs adding custom functionality like this but as there are other lots of specific sample methods I guess that can be a battle for another day.

I think we would ideally modularise the current implementation as as it stands its quite compute wasteful and duplicative

seabbs commented 3 months ago

https://github.com/epiforecasts/scoringutils/pull/854#issuecomment-2230209168

Seems like a very good unit/integration test

nikosbosse commented 3 months ago

the problem is that this function is called on a vector/matrix so that's not directly amenable to S3.

I don't understand this point? the input arg of an s3 method (i.e x etc) can change type to be whatever you want. If it couldn't how would summary etc work

But this is a standalone scoring function that can in principle be called on a vector/matrix (and within score(), we do just pass a vector/matrix to it). That vector/matrix doesn't know anything about the class system we implemented for data.frames.

We could think about changing that, but this would be quite a big design change for the entire package.

seabbs commented 3 months ago

We could think about changing that, but this would be quite a big design change for the entire package.

Got you. Is there any issue to address? I think rather than a design decision I would call it technical debt!

nikosbosse commented 3 months ago

Got you. Is there any issue to address? I think rather than a design decision I would call it technical debt!

Well I think to a large extent the fact that you can call the scoringutils functions in other packages and that you can use functions from other packages is a feature, not a bug!

seabbs commented 3 months ago

Well I think to a large extent the fact that you can call the scoringutils functions in other packages and that you can use functions from other packages is a feature, not a bug!

technical debt isn't a bug. My point is more that I don't think a active decision as to the right approach has been taken here it is more that this is just where we are right now.

nikosbosse commented 3 months ago

@sbfnk I made the suggested changes in #869.

nikosbosse commented 3 months ago

Agree with Seb and would suggest addressing the code duplication in a separate issue (https://github.com/epiforecasts/scoringutils/issues/870). Other than that, any objections to this getting merged?

nikosbosse commented 3 months ago

wohoo! Thanks a lot @sbfnk, and Johannes! This is a really cool PR.

seabbs commented 3 months ago

I appreciate outvoted here and its only minor but I really don't think we should be merging duplicated code etc with features at this stage of scoringutils lifecycle. In many ways, some of these practices are what led to us needing such a large scale refactor (though that has ended up having lots of benefits).

nikosbosse commented 3 months ago

Appreciate the point. My thinking at the moment is that it would be good to finally get this out on CRAN to avoid the co-existence of an outdated version and an obscure dev version, so I'd tend to prioritise public-facing code changes over internal refactoring for now and then circle back to another refactor/clean-up.

But then again you're right and according to this logic we maybe should not have introduced this feature at all.