ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

Rewrite job negotiation / module command line handling #47

Closed ewels closed 9 years ago

ewels commented 9 years ago

Currently, each module is queries for each file group three times:

I want to add support to allow the module to predict the time that the job will take (see #45), which will make a fourth. This is starting to take a noticeable amount of time and is very inefficient.

Whilst I think we still need to query each module for each file group (the number of files / file size of input and other variable can vary across file groups), we certainly don't need to call the module four times for four variables (maybe more in the future).

Instead, I'd like to rebuild this system to a more robust, scaleable methodology. Aims and requirements:

My suggestion is that we combine the current calls into one, eg:

module.cfmod.pl --mem $TOTAL_MEM --cores $TOTAL_CORES --modules --runfn $runfn

a helper function then parses these and if any of the 'request' parameters are found, the module returns a hash in some standard format: JSON, YAML, ini, XML etc. on STDOUT. eg:

{
  "cores": 16,
  "memory": "64G",
  "modules": ["bowtie", "samtools"],
  "time": "6:00:00"
}

This can then be interpreted by Cluster Flow for job submission. If none of these command line flags are present, the module will be run in executing mode (unless --help was there).

My preferred response would be JSON as in the example above. I think it's fairly clear and simple and widely supported. However, Perl doesn't seem to have a JSON parsing module as part of the core distribution and I don't like introducing new dependencies.

@s-andrews, @FelixKrueger, @stu2: Does anyone have any thoughts or suggestions on the above?

Whilst doing this rewrite I'd also make a load of new command line flags for execution time (--runfile, --job_id, --prev_job_id, --cores, --mem, multiple --param) instead of the current positional @ARGV mess (see #29).

ewels commented 9 years ago

Apparently JSON comes bundled with Perl after version 5.13.9..

I guess an alternative is that we use a custom format, much like we do for pipelines.

cores: 16
memory: 64G
modules: bowtie,samtools
time: 6:00:00

Not as nice, but could be easier..

ewels commented 9 years ago

I made a start with the custom format above on the param_passing branch. Core support is now mostly written I think, just need to re-write all of the modules to work with it... :cold_sweat:

ewels commented 9 years ago

Pull request submitted.

2,669 lines of code added, 2,277 lines deleted.. That escalated quickly.

ewels commented 9 years ago

Now to find the bugs I've introduced..