epinowcast / epidist

An R package for estimating epidemiological delay distributions
http://epidist.epinowcast.org/
Other
9 stars 4 forks source link

Replace `dplyr` use with `data.table` (or vice versa) #246

Open athowes opened 4 weeks ago

athowes commented 4 weeks ago

I am not a data.table native and so when under pressure default to using dplyr. I think this is fine to get things done for the first release of the package, but after that I think I should spend some time more properly learning data.table and going back over the places I've used dplyr in this package and replacing them. Reasons to do this? It's faster I think. And perhaps is one less dependency. Arguably it's not a good use of my time to do this, but learning data.table seems useful anyway.

pearsonca commented 4 weeks ago

I'm DT native, so happy to tagged on discussions of "how do I ..." / review PRs aimed in this direction / etc.

seabbs commented 4 weeks ago

I just checked out dependencies and whilst brms doesn't depend on dplyr bayesplot which brms depends on does so adding/removing dplyr doesn't impact the dependency graph.

That being said I strongly think we want to only have one design language when it comes to data munging and as everywhere else that is data.table it feels like it should remain so.

We could discuss porting the whole package to dplyr potentially. I am not a massive func as I have struggled with maintaining dplyr based packages in the past but. Ithink things are now more stable

athowes commented 3 weeks ago

As well as in https://github.com/epinowcast/epidist/pull/242#discussion_r1718224312, i.e.

prior <- dplyr::full_join(...)

I've also been using dplyr in the vignettes.

athowes commented 1 week ago

I am happy to either try to learn data.table and move things to data.table or to try moving everything to dplyr which I already know well.

I don't have a great sense of the other aspects of this choice. I've not maintained packages using dplyr or data.table before.

If there aren't objections I am personally happy enough with dplyras it's what I already know.

I just checked out dependencies and whilst brms doesn't depend on dplyr bayesplot which brms depends on does so adding/removing dplyr doesn't impact the dependency graph.

This pushes more towards dplyr OK as as far as I understood one main benefit of data.table was the low dependencies.

seabbs commented 1 week ago

Yes, that and stability. Having said that I think dplyr is much more stable now. As this package is very much moving towards a cog in a wider tidy ecosystem I think its a good call to switch over to dplyr as much as possible and drop data.table.

@pearsonca thoughts?

FYI I see this as needing to be solved ahead of 0.1.0 so as to reduce ongoing technical debt

pearsonca commented 1 week ago

I'm generally anti-tidyverse. Too tight knit - pretend modular, not actually modular - though I'll agree they are doing better at coupling (and that as a whole system, does fine - I just prefer less vendor lock in).

For a moment considering false dichotomy problem: how fancy is the munging anyway? Would it just be fine in base R? Cursory glance suggests that the only in code use is join, filter, and select - that could be done approximately with merge, subset, and some sort of manual select. Those are all manageable in way that works portably for data.frame, data.table, and tibble.

Leaving it in the vignette is fine - probably desirable for the audience, and just becomes a "suggests" which is better posture wise (even if there is a transient dependency due to other packages - maybe they'll get better at some future time).

seabbs commented 1 week ago

I think those are reasonable points @pearsonca but I thing the missing piece is what system is going to be easiest to maintain for the active contributors. Historically that has been data.table but as @athowes is now the maintainer (and seems more comfortable with dplyr since it keeps slipping in) and other new contributors this is also likely to be the case.

The counter point that is strongest to me is that most of the code is data.tablee, its performant easy to maintain and fairly readable. To me the base arg doesn't really have that

pearsonca commented 1 week ago

I generally use the merge syntax in base over the full join data.table syntax - too fiddly. subset(dt, ...filter...) is definitely as easy as dt[...filter...]. Both of those operations are provided S3 generics by data.table and tibble, so they are performant and type preserving. Making the select portable is about the only thing that can be complicated last I checked, but dt might have revised this to less data.frame-paradigm breaking.

I think there's a bit of balance to be struck between easiest-for-active maintainers vs other engineering objectives (e.g. easiest for new maintainers [which might also cut towards tidyverse], reducing direct vs encapsulated dependencies).