Open ghost opened 8 years ago
Brainstorming a few of the use cases:
I have a rough version of multi-site PEcAn working here. A couple of notes are below. Comments welcome but this is really just a brain dump / progress report. Feel free to wait until I get things cleaned up a bit.
parse.global.settings
to settings
package. Currently handles just the multi-site case, turning a generic settings list (parsed from XML) with <runs><run>...
block into a SettingsList with one individual Settings object per <run>
. The idea is that future use cases would only need to add a little code to this function to create an appropriate SettingsList, and the rest of the workflow could be unchanged. This issue is stale because it has been open 365 days with no activity.
For the work I'm doing, we need to automate doing similar PEcAn runs at many different sites. As I understand, there's a general interest in building this capability into PEcAn—essentially, having a new master workflow that triggers multiple related workflows in an intelligent way. @mdietze and I have discussed this offline a bit, and I wanted to elicit additional feedback here. I left out a lot of details and this still grew into a pretty big post, so apologies in advance for that...
Settings
I'm planning that we'd still have a single settings XML file for a multi-run workflow. First we'll need to separate out some non-run-specific settings from the
run
block, per GH-212. Then we can encasuplate multiplerun
blocks under a newruns
(orrunlist
?) tag.Changes to the workflow
There are some parts of the existing
workflow.R
that only need to be run once, even when initiating a multi-run workflow, including:library()
andoptions()
callsI think the rest (CONFIG, MODEL, OUTPUT, ENSEMBLE, SENSITIVITY, PDA) is conceptually run-specific.
So, the idea is that the master workflow would get set up, read in the settings file, and do the meta-analysis in about the same way it currently does. Then it would prepare
settings
objects for each run, and loop over these to perform the run-specific steps.I'm not currently thinking about any grand meta-results-collection/analysis scheme. I just want the workflow to trigger all the runs. But obviously there are cool options for analyzing/displaying results from e.g. multiple sites run in a single workflow.
Directory structure
I'm assuming the master settings object will specify a main output directory, and then individual workflows will send all results to subdirectories. So I would propose replacing the "run" and "out" directories specified under
host
with a single "workflow.directory" entry. Then we can add arun.name
tag underrun
to be used for naming a run-specific subdirictory within the workflow dir. "run" and "out" directories can go in there.An alternative is to have master run/out directories (still specified under
host
), and put run-specific subdirs in each. This strikes me as somewhat less future-proof, but it would preserve the ability to keep "out" and "run" on separate drives (which I assume is why there are two directories underhost
rather than just one?).Turn workflow.R into
workflow(...)
I was thinking this and other tasks would be made easier by converting the current workflow script into a function. Initially it could simply take a settings object or path to an xml file as an argument, basically like the script does now. But we could also do things like add boolean arguments to turn on/off modules (overriding what's in the settings object—I thought this might be handy for testing). In offline discussions Mike and I had some of the same ideas about ultimately wanting the workflow to be very modular. I thought functionalizing it was a good first step (and had some specific details in mind that would be useful to me for the current work), though he wasn't so sure. Perhaps this belongs in a separate issue, but thought I'd mention it here.
Job management
Finally, since the meta-workflow is going to call potentially many individual workflows and each of those could require potentially many model runs, some thought needs to go into how to manage the jobs. Obviously we don't want to just run the individual workflows sequentially. On the other hand, running them completely in parallel is probably a bad idea too—even if all the model runs are getting farmed out to a cluster where they're handled by a queue, you'd still have a process for each workflow running on the main machine setting up jobs, waiting for them, etc.
Again, maybe a separate issue (apparently related to some tricks of @robkooper's for batching SA/EA runs on geo?), but getting the ball rolling here...