Closed Bisaloo closed 2 years ago
This is amazing and very useful.
These points about reducing interface complexity seem spot on:
- drop the verbose argument. Either the diagnostic messages are unnecessary and should be dropped entirely or they are useful and should be printed. If users really don’t want to see messages/warnings, they can use base functions > suppressMessages() / suppressWarnings(). This could also be controlled by a global switch in options() like usethis is > doing.
- plotting side effects could be removed. The primary goal of eval_forecasts() is to return a data.frame with the scores. Users that want the plot could call another function afterwards. This would allow the removal of the pit_plots argument.
- remove the possibility of having either a single data or forecasts, truth_data and merge_by as inputs. get rid of summarised and act as if summarised = TRUE if by != summarise_by (is this Check whether we can get rid of the summarised = TRUE argument? #106?)
Bonjour!
A few questions:
compare_two_models()
and pairwise_comparison_one_group()
recommended using the fct() when talking about functions. It is then easier to make the difference between functions and other objects and it enables auto-linking to the function documentation in the pkgdown website.
- does that work in the vignette as well?
function()
both in the readme and the vignette. Is there a better way to handle this?there is a minor issue with the equation rendering in the pkgdown website (e.g., https://epiforecasts.io/scoringutils/reference/abs_error.html). The solution is probably to pass both a LaTeX/mathjax and a ASCII version of the equation to \deqn{}.
- how can I do that?
As mentioned in another discussion, there is some inconsistency in the use of data.table and modifying in place vs copying. Beyond the stylistic issue, this is a possible source of bugs so I’d recommend sticking to one or the other.
- not entirely sure what to do here
Thank you very much!
I think you can't have examples for non-exported functions? Or at least you can't without some workaround. I would say no anyway.
Yes
Will the paper be kept updated for ever? I would probably be in favour of having the paper content spread across multiple vignettes as I imagine quite long. That way will be more fluid and easy to update. If the vignette is the same as the readme I would probably drop the vignette or focus move most of the content into the vignette and just keep a small quick start in the readme.
no idea on the equation issue (@bisaloo probably knows).
I would just use data.table
or dplyr
. I doon't think its a major issue though.
The preview you gave today looked great by the way.
Is this closable or worth going back over?
I think we can close it. Testing remains an issue, but that has its own issue :)
Thank you again @Bisaloo, this was amazing!
Might be best to let @Bisaloo close as part of the review process?
Yep, I think the major points have been moved to separate issues. Let's continue the discussion there.
Preface
This is an informal review conducted by a lab member. To ensure maximal objectivity, the rOpenSci review template is used. This template also guarantees that this package is following the most up-to-date and strictest standards available in the R community.
The template is released under CC-BY-NC-SA and this review is therefore published under the same license.
The review was finished on 2021-07-27 and concerns the version 0.1.7.2 of scoringutils (commit de45fb7d78e1a334fe9e24cf3435f648c82a3cb8).
Package Review
Documentation
The package includes all the following forms of documentation:
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
fct()
when talking about functions. It is then easier to make the difference between functions and other objects and it enables auto-linking to the function documentation in the pkgdown website. (PR #119)[x] Function Documentation: for all exported functions
eval_forecasts_quantile()
doesn’t have the same treatment as, e.g.,eval_forecasts_binary()
and get some basic documentation.plot = TRUE
’. Sometimes, it reads ‘only required ifplot == TRUE
’, but not always. I recommend switching all these statements using the phrasing ’ifplot = TRUE
, which seems more common in the R community.\deqn{}
.metrics
argument should point to thelist_of_avail_metrics()
function so users know what their choices are.summarised
argument ineval_forecasts()
. Do you meansummarise_by
instead ofgroup_by
? https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/eval_forecasts.R#L109-L110test_options
argument incompare_two_models()
. It reads: > list with options to pass down tocompare_two_models
But we are already in the documentation of compare_two_models and I don’t see any more documentation of this argument and which values are possible.
rel_skill_metric
argument ineval_forecasts()
. It reads:What do you mean by ‘where appropriate’, how do you choose which one these indices is used.
@details
ineval_forecasts()
. For example, where is the coverage?[x] Examples (that run successfully locally) for all exported functions
Some functions are missing examples:
compare_two_models()
pairwise_comparison_one_group()
quantile_to_long()
quantile_to_wide()
range_to_quantile()
quantile_to_range()
sample_to_range()
merge_pred_and_obs()
hist_PIT()
hist_PIT_quantile()
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL
,BugReports
andMaintainer
(which may be autogenerated viaAuthors@R
).DESCRIPTION
(alongside the URL to the GitHub repo). This helps users looking for documentation but it is also necessary to enable automatic linking from other pkgdown websites (done in PR #118).Functionality
[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
Regarding continuous integration, the package uses the current best practices in the community: GitHub actions with tests on various OS and R versions.
[ ] However, testing is the weakest point of this package. The coverage at the time of writing this review is at 48.9469862, when a package of this type should aim for around 90%% (excluding plotting functions that are usually more difficult to integrate in tests). Additionally, this value is somewhat inflated by ‘weak’ tests that only test if the function is producing an output but does not check this output. A good practice for a package of this type would be to compare its output to ‘known true values’ from the literature. Ideally, papers where these indices are defined.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
In terms of interface and what functionality is exposed to users, I think this package does exactly the right thing by providing access to both the low-level individual scores and to the higher-level
eval_forecasts()
.[x] There are some inconsistencies in naming. I mostly noticed this while creating the pkgdown reference index (PR #113). With just the function/object names, it was difficult to infer what each function would return. I would recommend using the prefix
example_
for all example data objects. It is present in all names currently but at a different position in many times. Similarly, it would be helpful to have the plotting functions start withplot_
or something of the sort.It would be useful to tag each release on GitHub as some users might rely on this to get informed of a new release. It is made easier with the
usethis::use_github_release()
function.[x] News items do not appear for some versions on https://epiforecasts.io/scoringutils/news/index.html. This is because the version number and the subsection have the same heading level. I submitted a fix in #116.
[x] The README (
Readme.md
) and the CHANGELOG (NEWS.md
) should be added to the package tarball so there are available with the package on CRAN. This is done by removing these files from.Rbuildignore
(done in #117)[x] It should be easy to remove the dependency on forcats and I would be inclined to remove it as it introduces an extra potential source of breakage for a very minimal gain (as opposed to dependency on data.table or ggplot2). But since all the recursive dependencies of forcats are also dependencies of ggplot2, I would understand it the lead developer decides it’s not worth the trouble.
[x] It is generally recommended to include a section ‘Similar projects’ in the README where you can highlight the differences and strengths of this tool against the existing alternatives. In this specific case, I would be interested in hearing about the differences with scoringRules which is one of the dependencies.
Estimated hours spent reviewing: 13h
Review Comments / Code Review
bool == TRUE
which doesn’t really make sense and makes the code more difficult to read. If the object is already a logical, you can use it as-is inif
. For example:https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L86-L90
could be simplified as
Some other occurrences:
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/eval_forecasts_continuous_integer.R#L52
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L157
Good job on specifying the type of your
NA
(NA_real_
,NA_integer_
, etc.)![x] I think there is some inconsistency in type / value checking before computation. For example, in
brier_score()
, there is a check thatpredictions
takes values between 0 and 1 but this check is not present in, e.g.,bias()
. Would it make sense to havecheck_truth()
,check_predictions()
internal functions that you call each time?[x] Would it be worth printing a warning when the user requests ‘coverage’ in
eval_forecasts_sample()
instead of silently dropping it?[ ] As mentioned in another discussion, there is some inconsistency in the use of data.table and modifying in place vs copying. Beyond the stylistic issue, this is a possible source of bugs so I’d recommend sticking to one or the other.
[x] Some functions have a very large number of arguments which makes them difficult to use. Research in software engineering tends to suggests to the number of arguments should not exceed ~7. As the function is complex, it may be difficult to reduce the number of arguments but here are possible some ways:
verbose
argument. Either the diagnostic messages are unnecessary and should be dropped entirely or they are useful and should be printed. If users really don’t want to see messages/warnings, they can use base functionssuppressMessages()
/suppressWarnings()
. This could also be controlled by a global switch inoptions()
like usethis is doing.eval_forecasts()
is to return a data.frame with the scores. Users that want the plot could call another function afterwards. This would allow the removal of thepit_plots
argument.data
orforecasts
,truth_data
andmerge_by
as inputs.summarised
and act as ifsummarised = TRUE
ifby != summarise_by
(is this #106?)I understand the wish to provide flexibility but since
eval_forecasts()
is meant to be a high-level wrapper / one-liner to compute everything, I believe it’s okay to provide a limited interface. Users that strive for more flexibility can always use the low-level individual scoring functions.[x] I think there is a misplaced closing parenthesis here and you mean
instead of the current:
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L165
Same here:
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L220
If this is indeed a bug, its presence twice in the same file suggests this code portion should be refactored as a function (isn’t that the purpose of
delete_columns()
already?). By the way, why is it> 1
in one case and> 0
in the other?[x] I cannot say for sure because as mentioned previously, I don’t understand the documentation of the
test_options
argument incompare_two_models()
but this selection of the first element ([1]
) does not seem super robust: https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pairwise-comparisons.R#L373[x] There are some occurrence where loops (
vapply()
) are used when you could rely on vectorized functions / linear algebra for much faster computation and a more readable code (fixed in #120):[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L95-L99
[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/bias.R#L106-L110
[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L169-L173
[ ] https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/pit.R#L193-L197
[x] In ggplot2 plots with the
facet_wrap_or_grid
argument, I would change the default value toc("facet_wrap", "facet_grid")
and start the function with:Currently, a very minor and inconspicuous typo such as
"facet_warp"
would make it silently switch tofacet_grid
and it would be very difficult to notice the mistake.[x] All functions in
scoringRules_wrappers.R
seem to have the same checks at the beginning. It would be less error-prone to refactor this.[x] This is not a robust way to get the value of an argument with defaults:
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils.R#L208
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils.R#L220
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/R/utils_data_handling.R#L422-L430
Instead, you should use:
Deprecated functions from
utils_data_handling.R
should be categorised as such in the pkgdown reference index.There is no unit testing here since this is not a testthat function:
https://github.com/epiforecasts/scoringutils/blob/de45fb7d78e1a334fe9e24cf3435f648c82a3cb8/tests/testthat/test-bias.R#L48
Conclusion
This is overall a solid package that could become a widely used tool in forecast sciences. I could not see any bugs in the code and the performance looks very good on the examples I ran. The package interface is clever and can surely prove useful to a large array of users thanks to the two levels of functions (low-level scoring functions vs all-in-one
eval_forecasts()
).Two points could slow down / reduce adoption and these should be fixed for this package to reach its full potential and attract as many users as possible:
the package remains complex to use. This complexity is in part inherent to the task but it could nonetheless be reduced by following best practices in software engineering such as reducing the number of parameters and adopting a consistent naming scheme.
there is no strong evidence that this package implements correctly the computed metrics. This is especially important for fields that can have a policy impact. Test coverage should be increased and comparisons to computation via other tools / methods should be added.