NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
4 stars 0 forks source link

Package deployment or pipeline discussion #191

Closed kyle-messier closed 6 months ago

kyle-messier commented 9 months ago

targets seems to be the latest and greatest R package for pipeline development in R

sigmafelix commented 9 months ago

Perhaps this issue has very wide overlaps with #176 ? I will look at the targets package.

kyle-messier commented 9 months ago

Yes- sorry- I forgot about that issue. At first glance, targets appears to be the package of choice, but we'll need to look into it more.

kyle-messier commented 9 months ago

targets definitely seems like the way to go. tutorial written by the authors here. In short, it keeps track of dependencies in a pipeline, keeps track of what needs to be re-run when something changes, and makes nice visualization of the pipeline.

Image

kyle-messier commented 8 months ago

@sigmafelix @dzilber @eva0marques @dawranadeep @Sanisha003 @mitchellmanware @MAKassien

I played around with implementing targets on a dev branch. Implementation seems straight forward, although I feel our code base is not ready for it yet. We need to have function (i.e. targets) ready to connect in a pipeline. With that said, I think the pipeline can motivate us to further organize or code. Along the lines of our discussions to move the download_* functions in Rinput/ to R/, I think the approach is to create a suite of functions in R/ that describe the high-level steps in the analysis pipeline. Similar to the pipeline in the readme:

  1. Download.R
  2. Preprocess.R
  3. Model_Fit.R
  4. etc.
  5. etc.

Then each of those has a sub-suite of functions that actually do the work. But I think that will make defining the targets for a reproducible pipeline more straight forward.

If there are any thoughts, please let us know!

@dzilber For more "complicated" targets the package does describe how to implement so-called dynamic targets using factories. I'm wondering if your familiarity with factory functions could help interpret this. And better yet, perhaps you could utilize this as an example for demonstrating factory functions in our group meetings.

kyle-messier commented 8 months ago

@sigmafelix @mitchellmanware

Is the zzz.R file like the ones articulated here? It looks like it is used to set up a working directory, etc.? Seems good for now - we'll see if it becomes obsolete down the line with targets implementation. Thanks!

sigmafelix commented 7 months ago

@Spatiotemporal-Exposures-and-Toxicology Yes, I was thinking of making a vignette for initial settings but ended up adding a .onLoad call in zzz.R for guidance. I agree on removing this file as soon as the pipeline is completed and documented.

sigmafelix commented 7 months ago

A very rudimentary example of a target configuration file:

https://github.com/sigmafelix/workbenches/blob/8237a4bae8732b36f81f13c80e4a617e099709a6/target-tidy/_targets.R#L1-L54

Most of the file and function names are examples.

sigmafelix commented 7 months ago

In the pipeline, an imputation function/part would be necessary. For various reasons, MODIS data have some missing days (a few days to a month during 2018-2022), which result in the calculated covariates from these having missing days. The days without the raw data are not contained in the covariate data.frames. Although we already developed a function to check any NAs exist in outputs, this function should be run before NA existence is checked.

Imputation method (simple [e.g., median/mean for the week/month/etc.]/linear/ML-based) needs to be discussed.

kyle-messier commented 7 months ago
sigmafelix commented 7 months ago

After #255 is completed, additional changes in DESCRIPTION need to follow