Implementing multi-run workflows

ghost commented 8 years ago

For the work I'm doing, we need to automate doing similar PEcAn runs at many different sites. As I understand, there's a general interest in building this capability into PEcAn—essentially, having a new master workflow that triggers multiple related workflows in an intelligent way. @mdietze and I have discussed this offline a bit, and I wanted to elicit additional feedback here. I left out a lot of details and this still grew into a pretty big post, so apologies in advance for that...

Settings

I'm planning that we'd still have a single settings XML file for a multi-run workflow. First we'll need to separate out some non-run-specific settings from the run block, per GH-212. Then we can encasuplate multiple run blocks under a new runs (or runlist?) tag.

Changes to the workflow

There are some parts of the existing workflow.R that only need to be run once, even when initiating a multi-run workflow, including:

library() and options() calls
Function definitions
Reading in the settings (assuming the xml will now contain all info needed for all sites)
Trait retrieval and meta-analysis
Cleanup ("FINISHED" block)

I think the rest (CONFIG, MODEL, OUTPUT, ENSEMBLE, SENSITIVITY, PDA) is conceptually run-specific.

So, the idea is that the master workflow would get set up, read in the settings file, and do the meta-analysis in about the same way it currently does. Then it would prepare settings objects for each run, and loop over these to perform the run-specific steps.

I'm not currently thinking about any grand meta-results-collection/analysis scheme. I just want the workflow to trigger all the runs. But obviously there are cool options for analyzing/displaying results from e.g. multiple sites run in a single workflow.

Directory structure

I'm assuming the master settings object will specify a main output directory, and then individual workflows will send all results to subdirectories. So I would propose replacing the "run" and "out" directories specified under host with a single "workflow.directory" entry. Then we can add a run.name tag under run to be used for naming a run-specific subdirictory within the workflow dir. "run" and "out" directories can go in there.

An alternative is to have master run/out directories (still specified under host), and put run-specific subdirs in each. This strikes me as somewhat less future-proof, but it would preserve the ability to keep "out" and "run" on separate drives (which I assume is why there are two directories under host rather than just one?).

Turn workflow.R into `workflow(...)`

I was thinking this and other tasks would be made easier by converting the current workflow script into a function. Initially it could simply take a settings object or path to an xml file as an argument, basically like the script does now. But we could also do things like add boolean arguments to turn on/off modules (overriding what's in the settings object—I thought this might be handy for testing). In offline discussions Mike and I had some of the same ideas about ultimately wanting the workflow to be very modular. I thought functionalizing it was a good first step (and had some specific details in mind that would be useful to me for the current work), though he wasn't so sure. Perhaps this belongs in a separate issue, but thought I'd mention it here.

Job management

Finally, since the meta-workflow is going to call potentially many individual workflows and each of those could require potentially many model runs, some thought needs to go into how to manage the jobs. Obviously we don't want to just run the individual workflows sequentially. On the other hand, running them completely in parallel is probably a bad idea too—even if all the model runs are getting farmed out to a cluster where they're handled by a queue, you'd still have a process for each workflow running on the main machine setting up jobs, waiting for them, etc.

Again, maybe a separate issue (apparently related to some tricks of @robkooper's for batching SA/EA runs on geo?), but getting the ball rolling here...

dlebauer commented 8 years ago

Brainstorming a few of the use cases:

run same workflow at each site but with different PFTs
- simple / similar case is prior + posterior runs
run same workflow at many sites
run same workflow over a region (and either all sites or on a grid within the region)
run different models with the same PFT
- requires using same PFT-species association for multiple models, but perhaps allowing some of the PFTs-priors to be model-specific

rykelly commented 7 years ago

I have a rough version of multi-site PEcAn working here. A couple of notes are below. Comments welcome but this is really just a brain dump / progress report. Feel free to wait until I get things cleaned up a bit.

Added SafeList, Settings, and SettingsList classes. The first is just a list without fuzzy-matching on $. The second is just a SafeList tagged with a class name so it can be identified elsewhere in the code. But I think it’s good to have it as its own class for future developments. The last is really just a list of Settings, but the assign/extract operators are overridden. The resulting behavior is a little unusual but helped in making existing code (mainly in read.settings()) work on either Settings or SettingsList.
Added a papply() function, which is like lapply() except it handles Settings, SettingsList, and even old settings lists (i.e., what read.settings used to return, before I added the new classes) correctly.
Did a lot of rearranging in read.settings. That was the bulk of the work, where by “work” I mean time spent breaking, debugging, and fixing things. Should work essentially the same as before, but handles the multi-site case and returns Settings or SettingsList as appropriate (single vs. multi-site .xml)
Added parse.global.settings to settings package. Currently handles just the multi-site case, turning a generic settings list (parsed from XML) with <runs><run>... block into a SettingsList with one individual Settings object per <run>. The idea is that future use cases would only need to add a little code to this function to create an appropriate SettingsList, and the rest of the workflow could be unchanged.
added “runModule.XXX” wrappers around most of the core PEcAn functions. Basically they just allow the functions to be called on either a Settings or SettingsList object. It might make sense to consolidate those, e.g. to a “workflow” package. That way the other packages can be written however they want (e.g., currently some core functions take numerous arguments and return various things or nothing at all), and the workflow package can be responsible for wrapping them in a way where they homogeneously take a single “settings” argument, and return the same. All the STATUS stuff could be handled there too. @mdietze and I have discussed converting workflow.R into a function too, in which case it could be a central part of this package.
All the original core functions should work the same as before, whether or not a user is using new Settings objects.
write.run.configs basically works the same as before, but the behavior of wiping runs.txt is now controlled by a flag. The flag defaults to TRUE (preserving previous behavior), but when run.write.configs is called on a SettingsList all the runs required by each component Settings object are appended into a single file.
I didn’t modify get.trait.data or run.meta.analysis. They're working fine in my multi-site test in which PFTs don’t vary among the runs. For future though, these should be modified so that they only run once per SettingsList, on the union of all PFTs listed in the component settings.
I also didn’t modify start.model.runs, but I don’t think we need to. As long as runs.txt has all the runs for all sites, calling it once does the job.
I also didn’t modify run.EA/SA yet, but this does need to be done. Run.write.configs does the right thing, creating separate ensembles for each Settings in a SettingsList. But I need to implement the changes we talked about last week on gitter (ensemble.ids belong to ), and teach run.EA/SA to loop over all ensembles in the meta-workflow. Currently I think they’d end up just running whichever ensembles were created most recently (i.e., do SA and/or EA on the last site in the list). I haven’t actually checked though since for my purposes I don’t need EA or SA.
I also need to add a snippet to allow read.settings to work on a site network, rather than list of sites.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 365 days with no activity.

PecanProject / pecan