epiverse-trace / blueprints

Software development blueprints for epiverse-trace
https://epiverse-trace.github.io/blueprints
Other
3 stars 3 forks source link

Policy around dependencies #7

Closed Bisaloo closed 1 year ago

Bisaloo commented 1 year ago

This issue is the result of a conversation we started with @joshwlambert and @pratikunterwegs.

It would be useful to brainstorm about our views on dependencies in epiverse packages. We will not reinvent the wheel and some dependencies are a no-brainer. Other dependencies are not so clear-cut:

As @joshwlambert said, beyond subjective and stylistic performances, trying to uniformise & standardise our stack can make us more efficient since we won't have to continuously switch between tools. @pratikunterwegs also mentioned that it would be problematic to introduce a dependency to a tool that only one person in the team really masters.

Relevant resources on this topic:

pratikunterwegs commented 1 year ago

To continue with this, I have checkmate and assertthat among dependencies in finalsize now as well. I usually don't sweat having these utilities included (although having both checkmate and assertthat is perhaps to be discouraged).

I think our hands might be more tied than we think with needing to plug in to existing work, so it might be good to check what the dependency graph as it were looks like. Other than that, I would go with keeping dependencies down. So for example, I would come down on the data.table side of things rather than tidyverse.

Finally, re: the issue of dependencies that require specialist knowledge, this is actually fairly manageable as long as we stick to R. Otoh, P3 uses (or will use) a good deal of Rcpp(Eigen). Might be good to think how we could get some more Rcpp capacity?

thibautjombart commented 1 year ago

To continue with this, I have checkmate and assertthat among dependencies in finalsize now as well. I usually don't sweat having these utilities included (although having both checkmate and assertthat is perhaps to be discouraged).

I would settle for a single tool if possible for checking inputs. I liked that checkmate covers most of my use-cases so far and is pretty easy to use, but happy to hear thoughts.

thibautjombart commented 1 year ago

Thanks for this thread. I agree it is a very important topic and behind consistent across the board is very important.

Do we use utility packages to check our function arguments? I see that @thibautjombart is using checkmate in https://github.com/epiverse-trace/linelist.

Are we okay with using data.table and/or the tidyverse in our packages or do we want to stick with base?

I think a possible answer is: it depends on where in the software stack the package in question sits. For low level packages, I like keeping deps to a minimum. For high-level packages which are going to be typically used by data analysts who start their scripts by library(tydiverse) we can be more liberal. I guess it depends on which part of tidyverse we pull in. In a nutshell:

There is also the possibility of providing optional features : a package can provide some extra functionalities if a dep is present, but does not need it to be installed. An example can be found in the trending package which provides a common interface for different modelling tools. In this package, an interface to brms is provided but optional: https://github.com/reconverse/trending/blob/master/R/models.R#L89-L98

And bmrs itself is only listed as 'suggest' in DESCRIPTION.

Do we decide we don't really care and it's the choice of the first person working on the package? It is important to remember that we will work on many already existing packages, for which it would not be convenient (and might be rude to the original authors) to change the whole dependency stack.

I think we do care, and we can set a standard for the new packages created in Epiverse-TRACE; for other packages we contribute to, I would be more cautious, e.g. discussing any new deps with the authors/maintainers before sending a PR.

joshwlambert commented 1 year ago

Thanks for the issue summarising the discussion. I agree with the points laid out by @thibautjombart. On the topic of argument checking, in the past I have used testit::assert() for checking argument types or attributes of variables, but have found myself more recently using base R and a fair amount of if statements and stop(). I have no preference on the choice of input checking we use. One thing I have found quite useful when developing with users in mind in a translation function that standardises user's input, as it gives the user a bit more flexibility. I've done this using by changing upper to lower case (also case-insensitive regex is an option) and replacing separators with underscores.

I do not have a preference on tidyverse vs data.table, but agree that the benefit of non-standard evaluation is outweighed by the added difficulty in development. I also think that the user-interface should be consistent across epiverse packages, so if we use it for one, we should use it for all user-facing packages. However, I've only used rlang's NSE in package development once so it is likely I am unaware of all of the benefits of using it.

TimTaylor commented 1 year ago

I would avoid any non-standard evaluation: the added value for the user is often marginal, but it makes the code a lot more complicated, and rlangisms have undergone substantial changes over time; not sure if it is more stable now; curious to hear thoughts? tagging @TimTaylor with whom we discussed this for incidence2

My views have definitely evolved over the last couple of years. Similar to @joshwlambert and @thibautjombart I now think the development overhead of supporting NSE is not worth the benefit and will be removing it from future versions of incidence2 (although that won't be happening for a while). If used it also means that should your users wish to wrap one of your functions they also need need a solid awareness of NSE.

Are we okay with using data.table and/or the tidyverse in our packages or do we want to stick with base?

Do we use utility packages to check our function arguments? I see that @thibautjombart is using checkmate in https://github.com/epiverse-trace/linelist.

On dependencies:

Miscellaneous comments on utility packages:

Bisaloo commented 1 year ago

I've tried to stay quite neutral in the original comment but I've stated my position on dependencies in the 'Design principles' document of the fundiversity package (written in collaboration with Matthias Grenié):

We aim at having as few as dependencies as possible, unless they remain relatively lightweight and provide a large speed boost.

As CRAN packages are now automatically archived at the same time of any of their strong dependencies, dependencies for fundiversity should be well established, have a good track record at remaining on CRAN, and ideally already have a large number of reverse-dependencies.

Based on these guidelines, some acceptable dependencies are:

Additionally, special care is taken with packages that rely on external libraries, as their installation might be an issue on shared computing platforms where users don’t have super-user privileges.

Bisaloo commented 1 year ago

However, this discussion is bringing some topics I did not consider and has generated some new ideas:

thibautjombart commented 1 year ago

Very good point re translations.

On a related note: while I am globally in favour of using checkmate (I have moved from base R to checkmate when writing linelist) as I find it is often more concise for checking several things on an argument (e.g. type, length/dimensions, accept NULL or not) than base R, we would have a better handle on error messages when using base R.

Bisaloo commented 1 year ago

I'll detail the specific case of a dependency on data.table because it is a good example, and a topic that is likely to surface often.

Now, I don't think it means that we should use data.table in all the epiverse-trace packages. We should keep it for packages where it is likely to bring a significant boost in maintainability & performance. As such, I don't think it would be appropriate in finalsize, which uses mostly vectors in matrices, nor in epiparameter, which has limited data wrangling.

Conversely, it would be appropriate in packages with extensive data wrangling, and even more so if performance is crucial. This makes it a good candidate to be a dependency of the scoringutils package for example.