PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
202 stars 234 forks source link

Add Rmd template for basic PEcAn workflow #1866

Open tonygardella opened 6 years ago

tonygardella commented 6 years ago

Opening issue to discuss the closed PR #1733

Description

@dlebauer had the idea to create a template for a basic PEcAn workflow.

Context

To carry over some of the points made in #1733:

infotroph commented 6 years ago

To the last point: PEcAn.workflow was originally created because too many of the workflow-ish functions were causing circular dependencies in the other packages where they lived, but now that it exists I think it gives an opportunity to define friendlier and more flexible workflows.

Rather than trying to autogenerate one template for all purposes, I think a little thought and care in defining the workflow package, and documenting its functions as the interfaces to the rest of PEcAn they are, could go a long way towards making the workflow(s) more user-friendly.

ashiklom commented 6 years ago

From a conversation @mdietze and I had a while ago, perhaps the thing to strive for is to have the workflow be determined entirely by the settings object -- i.e. define run_workflow <- function(settings) {...} that executes the modules in the order they are presented in settings. Then, the entire workflow script would be:

run_workflow(read.settings("/path/to/pecan.xml"))

We could even wrap this in an executable R script that could be called directly from the command line:

#!/usr/bin/env Rscript
settings <- commandArgs(trailingOnly = TRUE)
if (length(args) == 0) settings <- "pecan.xml"
for (s in settings) {
  run_workflow(read.settings(s))
}
mdietze commented 6 years ago

I thought for sure that there was an issue open about this already, as I likewise feel like this is something that has been discussed previously (I think @rykelly was on that thread), and was definitely part of my long-term planning.

There is also another issue that discusses doing this within the do.conversions as well, perhaps as a starting point, because the order that specific inputs need to be processed does seem to vary a bit from model-to-model based on what pieces of information they put in which files. I think @ankurdesai may have been on that thread since I think part of the discussion centered around the ability to download and process two met products, and then fuse them together (e.g. using a regional product to gapfill a local product, but to do so requires first training the regional downscaling against the local)

dlebauer commented 6 years ago

run_workflow sounds like a good idea, but I think that should be in addition to rather than instead of modular and composable workflows that use r functions.

rykelly commented 6 years ago

We definitely discussed this back then, though it's a hazy memory now. I think my view was and would still be that ideally the various PEcAn functions would be defined with explicit arguments and return values, as opposed to taking and returning a settings object that holds all kinds of extraneous stuff. There would be lots of benefits to that from a design standpoint, I think.

Then, the PEcAn.workflow package (that's new, right?) could consolidate all the requisite knowledge about how to move information around among the modules. For one thing, it could handle moving the right info in and out of the Settings object to call each module (I think that's what I had in mind with the runModule functions I remember I'd started writing for at least some of the major modules). It could also contain functions that wrap up all the other miscellany in workflow.R (I remember various status flags and some other logic). And then yes, I agree with @dlebauer that the workflow should eventually become basically my_settings %>% runModule1() %>% runModule2() .... That could easily be wrapped into a function like @ashiklom suggests

run_workflow(settings, modules = settings$workflow_modules) {
  for(module in modules) {
    settings = module(settings)
  }
  return(settings)
}

but leaves everything transparent and flexible like David recommends.

I'd also think about introducing a parent object to pass around the workflow, rather than Settings. For instance, you could have a class Analysis that contains the settings, obviously, but also can hold other stuff. Then you could move away from having the workflow alter the settings of the workflow, which sounds like a bad idea on its face and has definitely caused bugs and confusion in the past. You could also have Analysis$results or whatever contain non-settings data that needs to be moved between modules (DB connections? MCMC samples?). I'm just thinking this could save the poor PEcAn settings from doing the quadruple (roughly) duty it's always done, and maybe avoid some of the need to keep track of data saved/read to disk throughout the workflow (asking for trouble).

Just a few cents worth from someone who's out of the game but, apparently, still likes to think (and pontificate) about PEcAn architecture...

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 365 days with no activity.