Rewrite job negotiation / module command line handling

ewels commented 9 years ago

Currently, each module is queries for each file group three times:

Each file group
- Each module
- Number of cores?
- Amount of memory?
- Environment modules?

I want to add support to allow the module to predict the time that the job will take (see #45), which will make a fourth. This is starting to take a noticeable amount of time and is very inefficient.

Whilst I think we still need to query each module for each file group (the number of files / file size of input and other variable can vary across file groups), we certainly don't need to call the module four times for four variables (maybe more in the future).

Instead, I'd like to rebuild this system to a more robust, scaleable methodology. Aims and requirements:

One call to module per file group
As many request parameters as we like (scalable)
Return as many key: value pairs as the module likes
- If we miss a requested variable, use a sensible default
Can't be language specific (modules can be written in any language)

My suggestion is that we combine the current calls into one, eg:

module.cfmod.pl --mem $TOTAL_MEM --cores $TOTAL_CORES --modules --runfn $runfn

a helper function then parses these and if any of the 'request' parameters are found, the module returns a hash in some standard format: JSON, YAML, ini, XML etc. on STDOUT. eg:

{
  "cores": 16,
  "memory": "64G",
  "modules": ["bowtie", "samtools"],
  "time": "6:00:00"
}

This can then be interpreted by Cluster Flow for job submission. If none of these command line flags are present, the module will be run in executing mode (unless --help was there).

My preferred response would be JSON as in the example above. I think it's fairly clear and simple and widely supported. However, Perl doesn't seem to have a JSON parsing module as part of the core distribution and I don't like introducing new dependencies.

@s-andrews, @FelixKrueger, @stu2: Does anyone have any thoughts or suggestions on the above?

Whilst doing this rewrite I'd also make a load of new command line flags for execution time (--runfile, --job_id, --prev_job_id, --cores, --mem, multiple --param) instead of the current positional @ARGV mess (see #29).

ewels commented 9 years ago

Apparently JSON comes bundled with Perl after version 5.13.9..

I guess an alternative is that we use a custom format, much like we do for pipelines.

cores: 16
memory: 64G
modules: bowtie,samtools
time: 6:00:00

Not as nice, but could be easier..

ewels commented 9 years ago

I made a start with the custom format above on the param_passing branch. Core support is now mostly written I think, just need to re-write all of the modules to work with it... :cold_sweat:

ewels commented 9 years ago

Pull request submitted.

2,669 lines of code added, 2,277 lines deleted.. That escalated quickly.

ewels commented 9 years ago

Now to find the bugs I've introduced..

ewels / clusterflow

Rewrite job negotiation / module command line handling #47