refactor savanna, cheetah

bd4 commented 6 years ago

Goals

allow savanna to be used independently of cheetah, i.e. make it easy to do single application runs with codar tools (no parameter sweeping involved).
separate specifications for Code and Sweeps, avoid repeated parameter definitions (see #35 )
make the internal code cleaner to it's easier to evolve features. The current set of abstractions is very leaky and messy, needs a re-write to pay down the technical debt.

Discussion

The current cheetah code is doing many different tasks in a way that is not cleanly separated:

Savanna-ish: 1) define spec for Code's and the parameters they take 2) define spec for how Code's are combined into multi-code 'workflows' (is there a better word for this? We aren't trying to support complex workflows here, that should use an existing workflow system). Note that it's common for there to be dependencies between parameters to different codes in workflow, so this is not cleanly separable from (1) 3) take a workflow spec and bind parameter values to it 4) take a bound workflow and add machine specific environment and configuration 5) generate script to execute a bound+machine configured workflow

Cheetah: 1) define spec for parameter sweeps (this should be based on savanna, rather than include Code/Workflow specs) 2) generate experiment hierarchy and run scripts (should be built on top of savanna single-workflow setup scripts) 3) (workflows.py) provide script for executing sweeps efficiently on fixed allocation 4) provide post-processing scripts

I've started brainstorming some code organization ideas in a branch originally designed to just separate parameter specification from sweep specification:

https://github.com/CODARcode/cheetah/tree/codes-param-def

In particular see docstrings in the new codar.savanna package. One of the parts I'm struggling with is when/how to bring in the machine specific runtime information. It's easy to come up with nice looking object models that start to break down when you try to execute an actually workflow - lots of messiness between layers. I think if we get the data structures and concepts right, and allow the layers to modify common data structures, it may work out better.

Ideally we should also support integration with a more complex workflow system, e.g. Parsl, and share machine specific execution script generation e.g. with Parsl libsubmit.

bd4 commented 6 years ago

A more operation description of savanna 'compilation'. As a starting point, assume the user has defined a Workflow instance containing several Code instances that describe the application components and their parameters. The goal of savanna compilation is to produce an object that contains all the information about how to run the workflow on a specific machine: command line to execute for each code, environment for each code, directory locations for each code/config file/param. Perhaps 'command line' is even too specific - an alternate compiler component could generate a set of parsl functions instead.

'augment' user defined Workflow spec with additional 'glue' codes (e.g. dataspaces, stage_write, sosflow)
'bind' user supplied parameter values for user defined Codes and 'glue' codes
'machine binding': set additional parameters and environment variables needed to run on a specific supported machine configuration. It may be necessary to load modules and source bash scripts to set up the execution environment, and some of this may be application specific (e.g. need module X on titan, module Y on cori).
'layout binding': set additional parameters related to node layout and nprocs
'generate executable': create pbs or slurm script, or parsl/swift script, that can be used to run the workflow.
(optional) execute directly, might be useful to support this in addition to generate/execute separately.

I'm not sure if this is the correct ordering or not - it doesn't matter too much in that if they are all using common object model, there can be multiple passes at different points in the compilation process for each category of binding. The problem is that there can be fairly complex dependencies between parameters, especially with things like SOSflow.

I think for cheetah it's useful to have a concept of uniquely named abstract parameters, but as far as the savanna object model goes, distinguishing between broad categories is useful: command line args, command line options, config files, environment variables, etc.

kshitij-v-mehta commented 6 years ago

This is similar to how Pegasus sets up workflows. Maybe we can look at their design to get some more ideas.

Create an abstract workflow description that describes the codes and how they are run. No real input or machine information yet.
Create Site Catalog (machine description), Transformation Catalog (code exe description for different machines), and Replica Catalog (input data description for different machines) as XML files.
Generate the workflow on a machine using the above files and submit it.

They use HTCondor for interfacing with different schedulers.

kshitij-v-mehta commented 6 years ago

Sections 2, 4, 5 at https://pegasus.isi.edu/documentation/

kshitij-v-mehta commented 6 years ago

In addition to describing codes, Savanna must be able to describe a set of actions on data. For example, it should be able to say 'apply zlib compression to variable T written by the heat application'. Users should be able to add additional operations easily. 'apply sz compression to variable T written by heat, and then apply zlib compression to the output of sz'.

So, some features of Savanna will have to be designed in a dataflow centric manner.

mw70 commented 6 years ago

I have been looking through parsl, and it seems to have some of the same issues with dealing with workflows of streaming components. I started poking around, and I'm currently taking a look at some work out of IU on workflow extensions to support both traditional and stream-based data movement in the workflow. The flow-centric nature of SPL was the reason I brought it up earlier -- I agree that it would be good to get something in the registration of the workflow that allows us to be intelligent there.

bd4 commented 6 years ago

Currently ADIOS data actions are modeled as a special parameter type. Is there a problem with that approach? We can still refine the parameter object model without having to change the fundamental approach.

Maybe the 'Code' definition should include a list of ADIOS variables? Then the ADIOS transform parameteres could be auto-generated.

On Tue, 06 Mar 2018 16:40:12 +0000 (UTC) Kshitij Mehta notifications@github.com wrote:

In addition to describing codes, Savanna must be able to describe a set of actions on data. For example, it should be able to say 'apply zlib compression to variable T written by the heat application'. Users should be able to add additional operations easily. 'apply sz compression to variable T written by heat, and then apply zlib compression to the output of sz'.

So, some features of Savanna will have to be designed in a dataflow centric manner.

bd4 commented 6 years ago

Parsl libsubmit has a nice abstraction over scheduler submit/status/cancel: https://github.com/Parsl/libsubmit/blob/master/libsubmit/providers/provider_base.py

Perhaps worth using instead of bash script templates. One downside is that it adds transitive dependencies on paramiko and cluster like aws and azure (See http://libsubmit.readthedocs.io/en/latest/quick/quickstart.html#requirements). We could potentially create a fork that removes everything we don't need, but that might be more work to maintain.

CODARcode / cheetah

refactor savanna, cheetah #95

Goals

Discussion