Closed Bisaloo closed 1 year ago
To continue with this, I have checkmate
and assertthat
among dependencies in finalsize
now as well. I usually don't sweat having these utilities included (although having both checkmate
and assertthat
is perhaps to be discouraged).
I think our hands might be more tied than we think with needing to plug in to existing work, so it might be good to check what the dependency graph as it were looks like. Other than that, I would go with keeping dependencies down. So for example, I would come down on the data.table
side of things rather than tidyverse
.
Finally, re: the issue of dependencies that require specialist knowledge, this is actually fairly manageable as long as we stick to R. Otoh, P3 uses (or will use) a good deal of Rcpp(Eigen)
. Might be good to think how we could get some more Rcpp
capacity?
To continue with this, I have checkmate and assertthat among dependencies in finalsize now as well. I usually don't sweat having these utilities included (although having both checkmate and assertthat is perhaps to be discouraged).
I would settle for a single tool if possible for checking inputs. I liked that checkmate
covers most of my use-cases so far and is pretty easy to use, but happy to hear thoughts.
Thanks for this thread. I agree it is a very important topic and behind consistent across the board is very important.
Do we use utility packages to check our function arguments? I see that @thibautjombart is using checkmate in https://github.com/epiverse-trace/linelist.
checkmate
or anything else we are happy withAre we okay with using data.table and/or the tidyverse in our packages or do we want to stick with base?
I think a possible answer is: it depends on where in the software stack the package in question sits. For low level packages, I like keeping deps to a minimum. For high-level packages which are going to be typically used by data analysts who start their scripts by library(tydiverse)
we can be more liberal. I guess it depends on which part of tidyverse we pull in. In a nutshell:
%>%
at all cost in packages as it makes debugging much more complicatedrlang
isms have undergone substantial changes over time; not sure if it is more stable now; curious to hear thoughts? tagging @TimTaylor with whom we discussed this for incidence2There is also the possibility of providing optional features : a package can provide some extra functionalities if a dep is present, but does not need it to be installed. An example can be found in the trending package which provides a common interface for different modelling tools. In this package, an interface to brms is provided but optional: https://github.com/reconverse/trending/blob/master/R/models.R#L89-L98
And bmrs itself is only listed as 'suggest' in DESCRIPTION.
Do we decide we don't really care and it's the choice of the first person working on the package? It is important to remember that we will work on many already existing packages, for which it would not be convenient (and might be rude to the original authors) to change the whole dependency stack.
I think we do care, and we can set a standard for the new packages created in Epiverse-TRACE; for other packages we contribute to, I would be more cautious, e.g. discussing any new deps with the authors/maintainers before sending a PR.
Thanks for the issue summarising the discussion. I agree with the points laid out by @thibautjombart. On the topic of argument checking, in the past I have used testit::assert()
for checking argument types or attributes of variables, but have found myself more recently using base R and a fair amount of if
statements and stop()
. I have no preference on the choice of input checking we use. One thing I have found quite useful when developing with users in mind in a translation function that standardises user's input, as it gives the user a bit more flexibility. I've done this using by changing upper to lower case (also case-insensitive regex is an option) and replacing separators with underscores.
I do not have a preference on tidyverse vs data.table, but agree that the benefit of non-standard evaluation is outweighed by the added difficulty in development. I also think that the user-interface should be consistent across epiverse packages, so if we use it for one, we should use it for all user-facing packages. However, I've only used rlang
's NSE in package development once so it is likely I am unaware of all of the benefits of using it.
I would avoid any non-standard evaluation: the added value for the user is often marginal, but it makes the code a lot more complicated, and rlangisms have undergone substantial changes over time; not sure if it is more stable now; curious to hear thoughts? tagging @TimTaylor with whom we discussed this for incidence2
My views have definitely evolved over the last couple of years. Similar to @joshwlambert and @thibautjombart I now think the development overhead of supporting NSE is not worth the benefit and will be removing it from future versions of incidence2
(although that won't be happening for a while). If used it also means that should your users wish to wrap one of your functions they also need need a solid awareness of NSE.
Are we okay with using data.table and/or the tidyverse in our packages or do we want to stick with base?
Do we use utility packages to check our function arguments? I see that @thibautjombart is using checkmate in https://github.com/epiverse-trace/linelist.
On dependencies:
Miscellaneous comments on utility packages:
if
stop
pattern as tends to be very efficient and readable. That said the some combination vctrs/rlang/cli can be quite quite nice when you want to bubble up errors from internal functions.perl=TRUE
). {stringi} is also very nice and prefer it's use in a package to {stringr} as for development that just adds unnecessary overhead.I've tried to stay quite neutral in the original comment but I've stated my position on dependencies in the 'Design principles' document of the fundiversity package (written in collaboration with Matthias Grenié):
We aim at having as few as dependencies as possible, unless they remain relatively lightweight and provide a large speed boost.
As CRAN packages are now automatically archived at the same time of any of their strong dependencies, dependencies for fundiversity should be well established, have a good track record at remaining on CRAN, and ideally already have a large number of reverse-dependencies.
Based on these guidelines, some acceptable dependencies are:
Additionally, special care is taken with packages that rely on external libraries, as their installation might be an issue on shared computing platforms where users don’t have super-user privileges.
However, this discussion is bringing some topics I did not consider and has generated some new ideas:
I agree it's a mess to deal with non standard evaluation as a package developer but I think it's still possible to make a good case for it if the package is expected to be used right in the middle of a tidyverse pipeline. I agree it's not worth it in the general case though.
I'm usually not a huge fan of input checking packages because base R already has good functionality. As mentioned in the blog post, it has also been improved by the addition of named arguments to stopifnot()
. However, I've been thinking about multilingual packages yesterday and I think that having a multilingual package is made much easier if we use a package like checkmate because we would just have to contribute translations there once (vs having to provide translations each time if using custom / hand-crafted error messages)
I'll investigate if we can have a quantitative approach, like described in the itdepends blog post.
Very good point re translations.
On a related note: while I am globally in favour of using checkmate
(I have moved from base R to checkmate
when writing linelist) as I find it is often more concise for checking several things on an argument (e.g. type, length/dimensions, accept NULL
or not) than base R, we would have a better handle on error messages when using base R.
I'll detail the specific case of a dependency on data.table
because it is a good example, and a topic that is likely to surface often.
data.table
is a lightweight package that introduces very few recursive dependencies;data.table
is widely used, which means it is likely to already be installed on the user machines, and that data.table
devs have very strong incentives to keep the package on CRAN;data.table
has a proven track-record of its ability to stay on CRAN, and it is following current best practices in R package development;data.table
is already used in several epiverse-adjacent packages, such as EpiNow2 and socialmixr, which makes it even more likely that users already have it, and reinforces the potential for collaboration with these external packages.Now, I don't think it means that we should use data.table
in all the epiverse-trace packages. We should keep it for packages where it is likely to bring a significant boost in maintainability & performance.
As such, I don't think it would be appropriate in finalsize
, which uses mostly vectors in matrices, nor in epiparameter
, which has limited data wrangling.
Conversely, it would be appropriate in packages with extensive data wrangling, and even more so if performance is crucial. This makes it a good candidate to be a dependency of the scoringutils
package for example.
This issue is the result of a conversation we started with @joshwlambert and @pratikunterwegs.
It would be useful to brainstorm about our views on dependencies in epiverse packages. We will not reinvent the wheel and some dependencies are a no-brainer. Other dependencies are not so clear-cut:
As @joshwlambert said, beyond subjective and stylistic performances, trying to uniformise & standardise our stack can make us more efficient since we won't have to continuously switch between tools. @pratikunterwegs also mentioned that it would be problematic to introduce a dependency to a tool that only one person in the team really masters.
Relevant resources on this topic: