scoringutils 0.1.7.2 review

Bisaloo commented 3 years ago

Preface

This is an informal review conducted by a lab member. To ensure maximal objectivity, the rOpenSci review template is used. This template also guarantees that this package is following the most up-to-date and strictest standards available in the R community.

The template is released under CC-BY-NC-SA and this review is therefore published under the same license.

The review was finished on 2021-07-27 and concerns the version 0.1.7.2 of scoringutils (commit de45fb7d78e1a334fe9e24cf3435f648c82a3cb8).

Package Review

Documentation

The package includes all the following forms of documentation:

☒ A statement of need clearly stating problems the software is designed to solve and its target audience in README

The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values.

[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
- [x] Multiple plots are impossible to read in the CRAN-rendered version of the vignette. The plot dimensions should be adjusted.
- [x] I would refrain from using dispensable non-base packages in the package as it introduces extra cognitive load for users. In addition to learning how to use scoringutils, they now possibly have to understand how to use whichever non-base package you use. (PR #119)
- [x] I recommended using the fct() when talking about functions. It is then easier to make the difference between functions and other objects and it enables auto-linking to the function documentation in the pkgdown website. (PR #119)
[x] Function Documentation: for all exported functions
- Good job writing what is the default argument value for many functions!
- [ x] I recommend using the roxygen2 markdown syntax (done in #115)
- [x] More functions should inherit documentation chunks to reduce risks of errors / outdated docs when updating
- [x] It’s seem strange that eval_forecasts_quantile() doesn’t have the same treatment as, e.g., eval_forecasts_binary() and get some basic documentation.
- [x] There is some inconsistency in formatting when using mentions such as ‘only required if plot = TRUE’. Sometimes, it reads ‘only required if plot == TRUE’, but not always. I recommend switching all these statements using the phrasing ’if plot = TRUE, which seems more common in the R community.
- [x] There is a minor issue with the equation rendering in the pkgdown website (e.g., https://epiforecasts.io/scoringutils/reference/abs_error.html). The solution is probably to pass both a LaTeX/mathjax and a ASCII version of the equation to \deqn{}.
- [x] The metrics argument should point to the list_of_avail_metrics() function so users know what their choices are.
- [x] There might be a typo in the documentation of the summarised argument in eval_forecasts(). Do you mean summarise_by instead of group_by? https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/eval_forecasts.R#L109-L110
- [x] I do not really understand the documentation of the test_options argument in compare_two_models(). It reads: > list with options to pass down to compare_two_models
But we are already in the documentation of compare_two_models and I don’t see any more documentation of this argument and which values are possible.
- [x] I don’t understand the documentation of the rel_skill_metric argument in eval_forecasts(). It reads:
If equal to ‘auto’ (the default), then one of interval score, crps or brier score will be used where appropriate

What do you mean by ‘where appropriate’, how do you choose which one these indices is used.
- [ ] Aren’t some metrics missing from the @details in eval_forecasts(). For example, where is the coverage?
[x] Examples (that run successfully locally) for all exported functions

Some functions are missing examples:
- [x] compare_two_models()
- [x] pairwise_comparison_one_group()
- [x] quantile_to_long()
- [x] quantile_to_wide()
- [x] range_to_quantile()
- [x] quantile_to_range()
- [x] sample_to_range()
- [x] merge_pred_and_obs()
- [x] hist_PIT()
- [x] hist_PIT_quantile()
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
- [x] There is no CONTRIBUTING guide. Adding one could signal more explicitly that external contributions are welcome.
- [ x] The link to the pkgdown website should be added in DESCRIPTION (alongside the URL to the GitHub repo). This helps users looking for documentation but it is also necessary to enable automatic linking from other pkgdown websites (done in PR #118).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
- I did not verify the package works for R versions as old as 2.10 but upon cursory glance, I could find use of any of the “newer” base R functions (listed in https://github.com/r-lib/backports).
Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Regarding continuous integration, the package uses the current best practices in the community: GitHub actions with tests on various OS and R versions.
- [ ] However, testing is the weakest point of this package. The coverage at the time of writing this review is at 48.9469862, when a package of this type should aim for around 90%% (excluding plotting functions that are usually more difficult to integrate in tests). Additionally, this value is somewhat inflated by ‘weak’ tests that only test if the function is producing an output but does not check this output. A good practice for a package of this type would be to compare its output to ‘known true values’ from the literature. Ideally, papers where these indices are defined.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
- In terms of interface and what functionality is exposed to users, I think this package does exactly the right thing by providing access to both the low-level individual scores and to the higher-level eval_forecasts().
- [x] There are some inconsistencies in naming. I mostly noticed this while creating the pkgdown reference index (PR #113). With just the function/object names, it was difficult to infer what each function would return. I would recommend using the prefix example_ for all example data objects. It is present in all names currently but at a different position in many times. Similarly, it would be helpful to have the plotting functions start with plot_ or something of the sort.
- It would be useful to tag each release on GitHub as some users might rely on this to get informed of a new release. It is made easier with the usethis::use_github_release() function.
- [x] News items do not appear for some versions on https://epiforecasts.io/scoringutils/news/index.html. This is because the version number and the subsection have the same heading level. I submitted a fix in #116.
- [x] The README (Readme.md) and the CHANGELOG (NEWS.md) should be added to the package tarball so there are available with the package on CRAN. This is done by removing these files from .Rbuildignore (done in #117)
- [x] It should be easy to remove the dependency on forcats and I would be inclined to remove it as it introduces an extra potential source of breakage for a very minimal gain (as opposed to dependency on data.table or ggplot2). But since all the recursive dependencies of forcats are also dependencies of ggplot2, I would understand it the lead developer decides it’s not worth the trouble.
- [x] It is generally recommended to include a section ‘Similar projects’ in the README where you can highlight the differences and strengths of this tool against the existing alternatives. In this specific case, I would be interested in hearing about the differences with scoringRules which is one of the dependencies.

Estimated hours spent reviewing: 13h

Review Comments / Code Review

[x] I noticed some occurrences of bool == TRUE which doesn’t really make sense and makes the code more difficult to read. If the object is already a logical, you can use it as-is in if. For example:

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L86-L90

could be simplified as

continuous_predictions <- !all.equal(as.vector(predictions), as.integer(predictions))

Some other occurrences:

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/eval_forecasts_continuous_integer.R#L52

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L157

Good job on specifying the type of your NA (NA_real_, NA_integer_, etc.)!
[x] I think there is some inconsistency in type / value checking before computation. For example, in brier_score(), there is a check that predictions takes values between 0 and 1 but this check is not present in, e.g., bias(). Would it make sense to have check_truth(), check_predictions() internal functions that you call each time?
[x] Would it be worth printing a warning when the user requests ‘coverage’ in eval_forecasts_sample() instead of silently dropping it?
[ ] As mentioned in another discussion, there is some inconsistency in the use of data.table and modifying in place vs copying. Beyond the stylistic issue, this is a possible source of bugs so I’d recommend sticking to one or the other.
[x] Some functions have a very large number of arguments which makes them difficult to use. Research in software engineering tends to suggests to the number of arguments should not exceed ~7. As the function is complex, it may be difficult to reduce the number of arguments but here are possible some ways:
- [x] drop the verbose argument. Either the diagnostic messages are unnecessary and should be dropped entirely or they are useful and should be printed. If users really don’t want to see messages/warnings, they can use base functions suppressMessages() / suppressWarnings(). This could also be controlled by a global switch in options() like usethis is doing.
- [x] plotting side effects could be removed. The primary goal of eval_forecasts() is to return a data.frame with the scores. Users that want the plot could call another function afterwards. This would allow the removal of the pit_plots argument.
- [x] remove the possibility of having either a single data or forecasts, truth_data and merge_by as inputs.
- [x] get rid of summarised and act as if summarised = TRUE if by != summarise_by (is this #106?)
I understand the wish to provide flexibility but since eval_forecasts() is meant to be a high-level wrapper / one-liner to compute everything, I believe it’s okay to provide a limited interface. Users that strive for more flexibility can always use the low-level individual scoring functions.
[x] I think there is a misplaced closing parenthesis here and you mean
```
if (length(cols_to_delete) > 1) {
```
instead of the current:

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L165

Same here:

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L220

If this is indeed a bug, its presence twice in the same file suggests this code portion should be refactored as a function (isn’t that the purpose of delete_columns() already?). By the way, why is it > 1 in one case and > 0 in the other?
[x] I cannot say for sure because as mentioned previously, I don’t understand the documentation of the test_options argument in compare_two_models() but this selection of the first element ([1]) does not seem super robust: https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L373
[x] There are some occurrence where loops (vapply()) are used when you could rely on vectorized functions / linear algebra for much faster computation and a more readable code (fixed in #120):

[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L95-L99

[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L106-L110

[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L169-L173

[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L193-L197

predictions <- matrix(rnorm(100), ncol = 10, nrow = 10)
true_values <- rnorm(10)

microbenchmark::microbenchmark(
  "vapply" = { vapply(seq_along(true_values), function(i) sum(predictions[i, ] <= true_values[i]), numeric(1)) },
  "vector" = rowSums(predictions <= true_values),
  check = "identical"
)

## Unit: microseconds
##    expr    min      lq     mean  median     uq    max neval
##  vapply 15.492 16.4605 17.74523 16.8215 17.302 47.741   100
##  vector  5.161  5.7475  6.24385  5.9215  6.089 34.897   100

[x] In ggplot2 plots with the facet_wrap_or_grid argument, I would change the default value to c("facet_wrap", "facet_grid") and start the function with:
```
facet_wrap_or_grid <- match.arg(facet_wrap_or_grid)
```
Currently, a very minor and inconspicuous typo such as "facet_warp" would make it silently switch to facet_grid and it would be very difficult to notice the mistake.
[x] All functions in scoringRules_wrappers.R seem to have the same checks at the beginning. It would be less error-prone to refactor this.
[x] This is not a robust way to get the value of an argument with defaults:

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils.R#L208

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils.R#L220

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils_data_handling.R#L422-L430

Instead, you should use:

comparison_mode <- match.arg(comparison_model)

...

if (comparison_mode == "ratio") { ... }

Deprecated functions from utils_data_handling.R should be categorised as such in the pkgdown reference index.
There is no unit testing here since this is not a testthat function:

https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/tests/testthat/test-bias.R#L48

Conclusion

This is overall a solid package that could become a widely used tool in forecast sciences. I could not see any bugs in the code and the performance looks very good on the examples I ran. The package interface is clever and can surely prove useful to a large array of users thanks to the two levels of functions (low-level scoring functions vs all-in-one eval_forecasts()).

Two points could slow down / reduce adoption and these should be fixed for this package to reach its full potential and attract as many users as possible:

the package remains complex to use. This complexity is in part inherent to the task but it could nonetheless be reduced by following best practices in software engineering such as reducing the number of parameters and adopting a consistent naming scheme.
there is no strong evidence that this package implements correctly the computed metrics. This is especially important for fields that can have a policy impact. Test coverage should be increased and comparisons to computation via other tools / methods should be added.

seabbs commented 3 years ago

This is amazing and very useful.

These points about reducing interface complexity seem spot on:

drop the verbose argument. Either the diagnostic messages are unnecessary and should be dropped entirely or they are useful and should be printed. If users really don’t want to see messages/warnings, they can use base functions > suppressMessages() / suppressWarnings(). This could also be controlled by a global switch in options() like usethis is > doing.

plotting side effects could be removed. The primary goal of eval_forecasts() is to return a data.frame with the scores. Users that want the plot could call another function afterwards. This would allow the removal of the pit_plots argument.

remove the possibility of having either a single data or forecasts, truth_data and merge_by as inputs. get rid of summarised and act as if summarised = TRUE if by != summarise_by (is this Check whether we can get rid of the summarised = TRUE argument? #106?)

nikosbosse commented 2 years ago

Bonjour!

A few questions:

do I need examples for non-exported functions? In the list above, this is compare_two_models() and pairwise_comparison_one_group()

recommended using the fct() when talking about functions. It is then easier to make the difference between functions and other objects and it enables auto-linking to the function documentation in the pkgdown website.

does that work in the vignette as well?

Also re the vignette: at some point probably the future paper should become the vignette. At the moment the vignette is essentially the same as the Readme.Rmd file (that is then displayed as Readme.md). At the moment I've just been using function() both in the readme and the vignette. Is there a better way to handle this?

there is a minor issue with the equation rendering in the pkgdown website (e.g., https://epiforecasts.io/scoringutils/reference/abs_error.html). The solution is probably to pass both a LaTeX/mathjax and a ASCII version of the equation to \deqn{}.

how can I do that?

As mentioned in another discussion, there is some inconsistency in the use of data.table and modifying in place vs copying. Beyond the stylistic issue, this is a possible source of bugs so I’d recommend sticking to one or the other.

not entirely sure what to do here

Thank you very much!

seabbs commented 2 years ago

I think you can't have examples for non-exported functions? Or at least you can't without some workaround. I would say no anyway.

Yes

Will the paper be kept updated for ever? I would probably be in favour of having the paper content spread across multiple vignettes as I imagine quite long. That way will be more fluid and easy to update. If the vignette is the same as the readme I would probably drop the vignette or focus move most of the content into the vignette and just keep a small quick start in the readme.

no idea on the equation issue (@bisaloo probably knows).

I would just use data.table or dplyr. I doon't think its a major issue though.

The preview you gave today looked great by the way.

seabbs commented 2 years ago

Is this closable or worth going back over?

nikosbosse commented 2 years ago

I think we can close it. Testing remains an issue, but that has its own issue :)

nikosbosse commented 2 years ago

Thank you again @Bisaloo, this was amazing!

seabbs commented 2 years ago

Might be best to let @Bisaloo close as part of the review process?

Bisaloo commented 2 years ago

Yep, I think the major points have been moved to separate issues. Let's continue the discussion there.

epiforecasts / scoringutils