toil with CWL on LSF status

mr-c commented 7 years ago

[x] ~~Must keep --workdir on a non shared filesystem like /tmp until https://github.com/BD2KGenomics/toil/pull/1573 is merged (Might be better from a performance perspective anyway)~~
[x] ~~Make sure to specify --retries 1 or higher so that killed job get retried with at least the default memory (from --defaultMemory 10Gi or similar) automatically~~ Nope, hand specify minimum memory and update those as jobs fail.
[x] Speaking of memory, add ResourceRequirements with fixed ~~or dynamic~~ ramMin to all tools.
[x] test specifying ResourceRequirements at the Workflow and WorkflowStep levels
[x] toil is experiencing a serialization bug, so don't use format with multiple options for inputs (for now) https://github.com/BD2KGenomics/toil/issues/1692
[x] ~~--preserve-environment takes a space separated list of environment variables to preserve, not a comma separated list as the docs previously reported https://github.com/BD2KGenomics/toil/pull/1689~~
[x] Use the TOIL_LSF_ARGS to specify the queue in your runscript: export TOIL_LSF_ARGS="-q production-rh7" ~~https://github.com/BD2KGenomics/toil/pull/1640~~
[x] ~~There's an error in enumerating jobs in Toil 3.7.0, fix is at https://github.com/BD2KGenomics/toil/pull/1690~~
[x] Toil doesn't have an override for cwltool's strict filename check, so be sure to strip out offending characters such as :, example at https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/commit/767cc8f54cb26ad2b53c544c2e2054d8e7116a26 https://github.com/BD2KGenomics/toil/issues/1782
[x] like most cwltool based CWL executors, you'll be happier if you set a dedicated output directory via --outdir
[x] the CWL output object is written to stdout, so redirect that to a file for posterity (example: cwltoil […] | tee output)
[x] --restart is handy for resuming a previous run, but (due to the lack of cache support while using the LSF batch system) any changes to the CWL descriptions will require a clean start
[x] apparently Toil will "make up" resource requirements on its own (randomly?) for tools without those annotations, so better be safe least cat get assigned 16 cores and 100GiB of memory :-)
[x] Toil runs testing using many batch systems (SLURM, Yarn, parasol, mesos, spark, GridEngine), but not LSF -- need to add setup instructions for spinning up a LSF cluster to https://github.com/BD2KGenomics/cgcloud/blob/master/jenkins/src/cgcloud/jenkins/toil_jenkins_slave.py
[ ] Review Globus toolkit's LSF code for inspiration: https://github.com/globus/globus-toolkit/blob/globus_6_branch/gram/jobmanager/lrms/lsf/source/lsf.pm
[ ] how to capture timestamps? they are output to stderr, but not in the on disk log
[ ] how to capture output from LSF?
[x] Is it possible to run the housekeeping jobs on the launcher node and not via cluster submission? (CWLWorkflow, ResolveIndirect, CWLGather, CWLScatter, etc.. ) https://github.com/BD2KGenomics/toil/issues/1783
[x] Restore usage of InitialWorkDirRequirment and confirm
[ ] write up the above lessons learned
[ ] we don't request space in /tmp even though Toil does write there
[x] Migrate Toil's LSF code to use AbstractGridEngineBatchSystem https://github.com/BD2KGenomics/toil/pull/2043

~~Current working branch will the bulk of the above fixed merged: https://github.com/mr-c/toil/tree/issues/1666-fail-not-on-unsubmitted-jobs~~ Latest Toil release has all of the above mentioned fixes merged

mr-c commented 7 years ago

Note: In cwltoil, sub-workflows must fully complete before any of their outputs are available for use by any other step/job. For example, the go_summary in the functional analysis (IPS) workflow isn't subject to further processing, but its production holds up the availability of the functional_annotations for futher processing by the parent workflow.

mr-c commented 7 years ago

To run the CWL conformance tests using cwltoil on LSF

virtualenv env
source env/bin/activate
pip install -U pip
pip install -U setuptools wheel
pip install .[cwl]
git clone https://github.com/common-workflow-language/common-workflow-language.git
cd common-workflow-language
pip install cwltest
TMP=$PWD ./run_test.sh RUNNER=toil-cwl-runner EXTRA="--batchSystem LSF --logDebug --logFile ${PWD}/log --disableCaching --user-space-docker-cmd=udocker" -j8"

(edited to use " double quote instead of single with EXTRA) (edited to set TMP to a path on the shared filesystem, needed for cwltest)

hmenager commented 7 years ago

Note for @mr-c : here's what I got from toil[cwl] running a workflow on a single machine, at some point I hear complaints about disk usage, although I never specified any requirements on that:

ripley 2017-06-08 18:50:26,180 Thread-82 WARNING toil.statsAndLogging: Got message from job at time 06-08-2017 18:50:26: Job used more disk than requested. Please reconsider modifying the user script to avoid the chance  of failure due to incorrectly requested resources. Job 'file:///home/hmenager/ReproHackathon/reprohackathon1/cwl/tools/fastq-dump.cwl' fastq-dump 8/A/job6HSIGE used 128.93% (2.6 GB [2768723968B] used, 2.0 GB [2147483648B] requested) at the end of its run.

The tool itself is defined there: https://github.com/IFB-ElixirFr/ReproHackathon/blob/cwl/reprohackathon1/cwl/tools/fastq-dump.cwl

mr-c commented 7 years ago

Priorities:

[x] correct unit detection https://github.com/BD2KGenomics/toil/issues/1691, see https://github.com/BD2KGenomics/toil/pull/1762
[x] move LSF to leverage the abstractGridEngineBatchSystem
[x] fix dynamic resource requirements: https://github.com/BD2KGenomics/toil/issues/1647

EBI-Metagenomics / ebi-metagenomics-cwl

toil with CWL on LSF status #57