Policy around dependencies

Bisaloo commented 1 year ago

This issue is the result of a conversation we started with @joshwlambert and @pratikunterwegs.

It would be useful to brainstorm about our views on dependencies in epiverse packages. We will not reinvent the wheel and some dependencies are a no-brainer. Other dependencies are not so clear-cut:

Are we okay with using data.table and/or the tidyverse in our packages or do we want to stick with base?
Do we use utility packages to check our function arguments? I see that @thibautjombart is using checkmate in https://github.com/epiverse-trace/linelist.
Do we decide we don't really care and it's the choice of the first person working on the package? It is important to remember that we will work on many already existing packages, for which it would not be convenient (and might be rude to the original authors) to change the whole dependency stack.

As @joshwlambert said, beyond subjective and stylistic performances, trying to uniformise & standardise our stack can make us more efficient since we won't have to continuously switch between tools. @pratikunterwegs also mentioned that it would be problematic to introduce a dependency to a tool that only one person in the team really masters.

Relevant resources on this topic:

It depends - A dialog about dependencies, by Jim Hester
Checking the inputs of your R functions, by Hugo Gruson, Sam Abbott, Carl Pearson (Tim Taylor also contributed to the initial discussion even though he's not included in the authors)

pratikunterwegs commented 1 year ago

To continue with this, I have checkmate and assertthat among dependencies in finalsize now as well. I usually don't sweat having these utilities included (although having both checkmate and assertthat is perhaps to be discouraged).

I think our hands might be more tied than we think with needing to plug in to existing work, so it might be good to check what the dependency graph as it were looks like. Other than that, I would go with keeping dependencies down. So for example, I would come down on the data.table side of things rather than tidyverse.

Finally, re: the issue of dependencies that require specialist knowledge, this is actually fairly manageable as long as we stick to R. Otoh, P3 uses (or will use) a good deal of Rcpp(Eigen). Might be good to think how we could get some more Rcpp capacity?

thibautjombart commented 1 year ago

To continue with this, I have checkmate and assertthat among dependencies in finalsize now as well. I usually don't sweat having these utilities included (although having both checkmate and assertthat is perhaps to be discouraged).

I would settle for a single tool if possible for checking inputs. I liked that checkmate covers most of my use-cases so far and is pretty easy to use, but happy to hear thoughts.

thibautjombart commented 1 year ago

Thanks for this thread. I agree it is a very important topic and behind consistent across the board is very important.

Do we use utility packages to check our function arguments? I see that @thibautjombart is using checkmate in https://github.com/epiverse-trace/linelist.

I think input-checking with meaningful error messages should be a must in all user-facing functions; my experience both as dev but also teaching is most problems with code not running come from mis-specified inputs, and catching these as early as possible can be a huge time-saver; it would make sense to settle for a common standard, either checkmate or anything else we are happy with

Are we okay with using data.table and/or the tidyverse in our packages or do we want to stick with base?

I think a possible answer is: it depends on where in the software stack the package in question sits. For low level packages, I like keeping deps to a minimum. For high-level packages which are going to be typically used by data analysts who start their scripts by library(tydiverse) we can be more liberal. I guess it depends on which part of tidyverse we pull in. In a nutshell:

I use tidyverse extensively when analysing data, but I try to use base R as much as possible when coding packages
I would advise to avoid %>% at all cost in packages as it makes debugging much more complicated
I would avoid any non-standard evaluation: the added value for the user is often marginal, but it makes the code a lot more complicated, and rlangisms have undergone substantial changes over time; not sure if it is more stable now; curious to hear thoughts? tagging @TimTaylor with whom we discussed this for incidence2
I like the quantitative approach outlined in the linked post; it may be a nice thing to use as a standard, or do you find it would be overkill?

There is also the possibility of providing optional features : a package can provide some extra functionalities if a dep is present, but does not need it to be installed. An example can be found in the trending package which provides a common interface for different modelling tools. In this package, an interface to brms is provided but optional: https://github.com/reconverse/trending/blob/master/R/models.R#L89-L98

And bmrs itself is only listed as 'suggest' in DESCRIPTION.

Do we decide we don't really care and it's the choice of the first person working on the package? It is important to remember that we will work on many already existing packages, for which it would not be convenient (and might be rude to the original authors) to change the whole dependency stack.

I think we do care, and we can set a standard for the new packages created in Epiverse-TRACE; for other packages we contribute to, I would be more cautious, e.g. discussing any new deps with the authors/maintainers before sending a PR.

joshwlambert commented 1 year ago

Thanks for the issue summarising the discussion. I agree with the points laid out by @thibautjombart. On the topic of argument checking, in the past I have used testit::assert() for checking argument types or attributes of variables, but have found myself more recently using base R and a fair amount of if statements and stop(). I have no preference on the choice of input checking we use. One thing I have found quite useful when developing with users in mind in a translation function that standardises user's input, as it gives the user a bit more flexibility. I've done this using by changing upper to lower case (also case-insensitive regex is an option) and replacing separators with underscores.

I do not have a preference on tidyverse vs data.table, but agree that the benefit of non-standard evaluation is outweighed by the added difficulty in development. I also think that the user-interface should be consistent across epiverse packages, so if we use it for one, we should use it for all user-facing packages. However, I've only used rlang's NSE in package development once so it is likely I am unaware of all of the benefits of using it.

TimTaylor commented 1 year ago

I would avoid any non-standard evaluation: the added value for the user is often marginal, but it makes the code a lot more complicated, and rlangisms have undergone substantial changes over time; not sure if it is more stable now; curious to hear thoughts? tagging @TimTaylor with whom we discussed this for incidence2

My views have definitely evolved over the last couple of years. Similar to @joshwlambert and @thibautjombart I now think the development overhead of supporting NSE is not worth the benefit and will be removing it from future versions of incidence2 (although that won't be happening for a while). If used it also means that should your users wish to wrap one of your functions they also need need a solid awareness of NSE.

Are we okay with using data.table and/or the tidyverse in our packages or do we want to stick with base?

Do we use utility packages to check our function arguments? I see that @thibautjombart is using checkmate in https://github.com/epiverse-trace/linelist.

On dependencies:

I prefer to minimise them where possible.

Miscellaneous comments on utility packages:

Areas such as input checking and error messaging are relatively low cost in development time, in that they are easy to switch out should an upstream package not meet your needs, be inefficient or prove flaky. For user facing functions, I actually prefer the if stop pattern as tends to be very efficient and readable. That said the some combination vctrs/rlang/cli can be quite quite nice when you want to bubble up errors from internal functions.
If you expect a user may call a function many times (e.g. in a bootstrap scenario) the overhead of input checking does add up so the choice of package could be important. Although, in this situation, you are likely better to provide an additional "fast" function that doesn't check inputs and force the user to check them where appropriate.
If any of your packages do anything beyond the simplest of aggregation you will want to utilise either {dplyr} or {data.table}. I have a preference to {data.table} but note that it doesn't support non-atomic columns like vctrs record types. This may or may not be a problem depending on use case but can be annoying (see https://github.com/nhs-r-community/NHSRepisodes/blob/main/README.md for an explanation of when this was recently a problem for me and forced me to the lower level stuff).
I tend to use the base implementations for regular expressions with the PCRE2 backend (with perl=TRUE). {stringi} is also very nice and prefer it's use in a package to {stringr} as for development that just adds unnecessary overhead.

Bisaloo commented 1 year ago

I've tried to stay quite neutral in the original comment but I've stated my position on dependencies in the 'Design principles' document of the fundiversity package (written in collaboration with Matthias Grenié):

We aim at having as few as dependencies as possible, unless they remain relatively lightweight and provide a large speed boost.

As CRAN packages are now automatically archived at the same time of any of their strong dependencies, dependencies for fundiversity should be well established, have a good track record at remaining on CRAN, and ideally already have a large number of reverse-dependencies.

Based on these guidelines, some acceptable dependencies are:

Rccp

geometry

Additionally, special care is taken with packages that rely on external libraries, as their installation might be an issue on shared computing platforms where users don’t have super-user privileges.

Bisaloo commented 1 year ago

However, this discussion is bringing some topics I did not consider and has generated some new ideas:

I agree it's a mess to deal with non standard evaluation as a package developer but I think it's still possible to make a good case for it if the package is expected to be used right in the middle of a tidyverse pipeline. I agree it's not worth it in the general case though.
I'm usually not a huge fan of input checking packages because base R already has good functionality. As mentioned in the blog post, it has also been improved by the addition of named arguments to stopifnot(). However, I've been thinking about multilingual packages yesterday and I think that having a multilingual package is made much easier if we use a package like checkmate because we would just have to contribute translations there once (vs having to provide translations each time if using custom / hand-crafted error messages)
I'll investigate if we can have a quantitative approach, like described in the itdepends blog post.

thibautjombart commented 1 year ago

Very good point re translations.

On a related note: while I am globally in favour of using checkmate (I have moved from base R to checkmate when writing linelist) as I find it is often more concise for checking several things on an argument (e.g. type, length/dimensions, accept NULL or not) than base R, we would have a better handle on error messages when using base R.

Bisaloo commented 1 year ago

I'll detail the specific case of a dependency on data.table because it is a good example, and a topic that is likely to surface often.

data.table is a lightweight package that introduces very few recursive dependencies;
data.table is widely used, which means it is likely to already be installed on the user machines, and that data.table devs have very strong incentives to keep the package on CRAN;
data.table has a proven track-record of its ability to stay on CRAN, and it is following current best practices in R package development;
data.table is already used in several epiverse-adjacent packages, such as EpiNow2 and socialmixr, which makes it even more likely that users already have it, and reinforces the potential for collaboration with these external packages.

Now, I don't think it means that we should use data.table in all the epiverse-trace packages. We should keep it for packages where it is likely to bring a significant boost in maintainability & performance. As such, I don't think it would be appropriate in finalsize, which uses mostly vectors in matrices, nor in epiparameter, which has limited data wrangling.

Conversely, it would be appropriate in packages with extensive data wrangling, and even more so if performance is crucial. This makes it a good candidate to be a dependency of the scoringutils package for example.

epiverse-trace / blueprints

Policy around dependencies #7