eco4cast / neon4cast

A helper R package for the neon4cast challenge
Other
7 stars 7 forks source link

Patch scoring speeds #7

Closed cboettig closed 2 years ago

cboettig commented 2 years ago

@rqthomas this is more-or-less a complete re-factor of score.R, so would greatly appreciate a review here. I've tried to keep the code to short functions doing simple discrete tasks, so hopefully it's not too difficult to read but please flag any areas that look dodgy as they may also be a source for bugs!

# A tibble: 280 × 16
# Groups:   theme, team, issue_date, siteID, time, target [280]
   theme     team  issue_date siteID time       target  mean     sd observed    crps   logs upper95 lower95 interval forecast_start_time horizon
   <chr>     <chr> <chr>      <chr>  <date>     <chr>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>   <dbl>   <dbl> <drtn>   <date>              <drtn> 
 1 phenology PEG   2021-03-02 BART   2021-03-03 gcc_90  0.34 0.0007    0.346 0.00587  33.6    0.341   0.339 1 days   2021-03-02           1 days
 2 phenology PEG   2021-03-02 BART   2021-03-04 gcc_90  0.35 0.001     0.344 0.00543  12.0    0.352   0.348 1 days   2021-03-02           2 days
 3 phenology PEG   2021-03-02 BART   2021-03-05 gcc_90  0.34 0.002     0.345 0.00348  -2.66   0.344   0.336 1 days   2021-03-02           3 days
 4 phenology PEG   2021-03-02 BART   2021-03-06 gcc_90  0.34 0.0006    0.344 0.00405  20.3    0.341   0.339 1 days   2021-03-02           4 days
 5 phenology PEG   2021-03-02 BART   2021-03-07 gcc_90  0.35 0.0007    0.344 0.00581  32.9    0.351   0.349 1 days   2021-03-02           5 days
 6 phenology PEG   2021-03-02 BART   2021-03-08 gcc_90  0.34 0.001     0.344 0.00372   3.17   0.342   0.338 1 days   2021-03-02           6 days
 7 phenology PEG   2021-03-02 BART   2021-03-09 gcc_90  0.34 0.0004    0.347 0.00648 134.     0.341   0.339 1 days   2021-03-02           7 days
 8 phenology PEG   2021-03-02 BART   2021-03-10 gcc_90  0.35 0.0007    0.345 0.00447  17.8    0.351   0.349 1 days   2021-03-02           8 days
 9 phenology PEG   2021-03-02 BART   2021-03-11 gcc_90  0.34 0.001     0.348 0.00709  23.3    0.342   0.338 1 days   2021-03-02           9 days
10 phenology PEG   2021-03-02 BART   2021-03-12 gcc_90  0.34 0.0009    0.345 0.00467  10.5    0.342   0.338 1 days   2021-03-02          10 days
# … with 270 more rows

Still testing this out against the actual submitted forecast library, we'll see what the net impact on speed will be. If this works, generating combined table though should now be both fast and trivial.

cboettig commented 2 years ago

@rqthomas ok I think this ready for review!

I've tested it against the scoring.R script for scoring all challenge entries (though not publishing anything). That run took only 30 min:

image

because the score file is now written out as soon as a forecast is scored, rather than collecting all scores and the writing out all the score csvs, this approach is more memory-efficient too, which means it should be okay to run with increased parallelization. I ran with 2 cores, but could easily go up to 4 or 8 and cut down scoring time further.

maybe best part is that generating the combined table is now simply a matter of reading in the csvs, which can leverage the readr 2.0 feature of taking a vector of csvs:

> bench::bench_time({
+   
+ scores_files <- fs::dir_ls("scores/", type="file", recurse = TRUE)
+ combined <- readr::read_csv(scores_files, progress = FALSE, lazy = FALSE, show_col_types = FALSE)
+ 
+ })
process    real 
    11s   4.37s

(haven't tried comparing these combined scores to the original ones yet though...)