ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
528 stars 111 forks source link

Singularity / LSF / Shared TMP #118

Open fbemm opened 4 years ago

fbemm commented 4 years ago

I am trying to run Cactus using Singularity and LSF. It works but since we barely have /tmp storage at some point Cactus crashes. Mostly during merging or quality rescoring steps due to insufficient disk space.

I am now trying to use a shared tmp. Each node has a /tmp/cactus now. The symlink points to a node-specific folder on a fast shared storage. The errors that I get now look the following:

kind-logAssemblyStats/instanceoQRLW7    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
kind-logAssemblyStats/instanceoQRLW7    INFO:toil:Running Toil version 3.22.0a1-91eab2b3a8c29e10db3a35a2ad5053fb462849c3.
kind-logAssemblyStats/instanceoQRLW7    WARNING:toil.resource:'JTRES_a70dbd2ee1bb358d2bc673a5dc0fe069' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
kind-logAssemblyStats/instanceoQRLW7    WARNING:toil.resource:'JTRES_a70dbd2ee1bb358d2bc673a5dc0fe069' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
kind-logAssemblyStats/instanceoQRLW7    INFO:cactus.shared.common:Work dirs: set([u'/tmp/cactus/toil-5813a342-8dbd-453d-b834-4bc44b04b8ab-5966802246d6bf3f6656afc558de6230/tmpnayN4c/baa8c9d6-4354-4532-aad7-fe92db1b0885'])
kind-logAssemblyStats/instanceoQRLW7    INFO:cactus.shared.common:Docker work dir: /tmp/cactus/toil-5813a342-8dbd-453d-b834-4bc44b04b8ab-5966802246d6bf3f6656afc558de6230/tmpnayN4c/baa8c9d6-4354-4532-aad7-fe92db1b0885
kind-logAssemblyStats/instanceoQRLW7    INFO:cactus.shared.common:Running the command ['singularity', '--silent', 'run', u'/mnt/scratch/mock/development/test-cactus/v1.jobstore/cactus.img', u'cactus_analyseAssembly', u'tmpmBhYSb.tmp']
kind-logAssemblyStats/instanceoQRLW7    bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
kind-logAssemblyStats/instanceoQRLW7    Running command catchsegv 'cactus_analyseAssembly' 'tmpmBhYSb.tmp'
kind-logAssemblyStats/instanceoQRLW7    cactus_analyseAssembly[0x40e820]
kind-logAssemblyStats/instanceoQRLW7    cactus_analyseAssembly[0x40e764]
kind-logAssemblyStats/instanceoQRLW7    cactus_analyseAssembly[0x401a5a]
kind-logAssemblyStats/instanceoQRLW7    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f5296fe2830]
kind-logAssemblyStats/instanceoQRLW7    cactus_analyseAssembly[0x401a99]
kind-logAssemblyStats/instanceoQRLW7    Could not open input file tmpmBhYSb.tmp: No such file or directory
kind-logAssemblyStats/instanceoQRLW7    Traceback (most recent call last):
kind-logAssemblyStats/instanceoQRLW7      File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/worker.py", line 366, in workerScript
kind-logAssemblyStats/instanceoQRLW7        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore, defer=defer)
kind-logAssemblyStats/instanceoQRLW7      File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/job.py", line 1392, in _runner
kind-logAssemblyStats/instanceoQRLW7        returnValues = self._run(jobGraph, fileStore)
kind-logAssemblyStats/instanceoQRLW7      File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/job.py", line 1329, in _run
kind-logAssemblyStats/instanceoQRLW7        return self.run(fileStore)
kind-logAssemblyStats/instanceoQRLW7      File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/job.py", line 1533, in run
kind-logAssemblyStats/instanceoQRLW7        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
kind-logAssemblyStats/instanceoQRLW7      File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/progressiveCactus-1.0-py2.7.egg/cactus/progressive/cactus_progressive.py", line 201, in logAssemblyStats
kind-logAssemblyStats/instanceoQRLW7        analysisString = cactus_call(parameters=["cactus_analyseAssembly", sequenceFile], check_output=True)
kind-logAssemblyStats/instanceoQRLW7      File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/progressiveCactus-1.0-py2.7.egg/cactus/shared/common.py", line 1182, in cactus_call
kind-logAssemblyStats/instanceoQRLW7        raise RuntimeError("Command %s failed with output: %s" % (call, output))
kind-logAssemblyStats/instanceoQRLW7    RuntimeError: Command ['singularity', '--silent', 'run', u'/mnt/scratch/mock/development/test-cactus/v1.jobstore/cactus.img', u'cactus_analyseAssembly', u'tmpmBhYSb.tmp'] failed with output: 
kind-logAssemblyStats/instanceoQRLW7    ERROR:toil.worker:Exiting the worker because of a failed job on host deei-bio87
kind-logAssemblyStats/instanceoQRLW7    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'logAssemblyStats' kind-logAssemblyStats/instanceoQRLW7 with ID kind-logAssemblyStats/instanceoQRLW7 to 5

I call Cactus like this:

PFX=$1
SEQ=$2
JOB=$PWD/$PFX.jobstore
LOG=$PWD/$PFX.logstore
mkdir -p $LOG
export TMP="/tmp/cactus/"
export TMPDIR="/tmp/cactus/"
export TEMP="/tmp/cactus/"

cactus --setEnv TMP --setEnv TMPDIR --setEnv TEMP --writeLogs $LOG --disableCaching --disableChaining --defaultDisk 16G --defaultMemory 16G --maxCores 1024 --batchSystem LSF --latest --binariesMode singularity file:$JOB $SEQ $PFX.hal

Files are all created properly on '/tmp/cactus' but Singularity seem to have no access to them. I can execute the Singularity command standalone and get the same result.

This is likely caused by Singularity not seing /tmp/cactus. Would "--bind" fix that problem?

fbemm commented 4 years ago

I just checked. Something like this works:

cd /tmp/cactus/toil-889d8897-ce9d-4a9f-a182-3f3197e96d05-5966802246d6bf3f6656afc558de6230/tmpFZmgW0/9075e680-e7e3-433e-a2a9-aa4f8213afc5
singularity --silent run -B ./:/test cactus.img cactus_analyseAssembly /test/tmpBNzr03.tmp
fbemm commented 4 years ago

Wouldn't it be best to bind the docker work dir to something different from $HOME?

fbemm commented 4 years ago

Same happens if I use a totally shared /tmp. The docker working dir is not passed to to the container.

fbemm commented 4 years ago

singularity run -B /shared/tmp/toil-036a771b-ea26-41d2-88e6-60c840797b69-40b99ca6183ff7196bb6253b57066c40/tmp7yoPZf/308164b4-3f95-48be-b449-094abc3c4028:/data cactus.img cactus_analyseAssembly /data/tmpLUOfsU.tmp

fbemm commented 4 years ago

The last one work again.

fbemm commented 4 years ago

Singularity is 2.4.2. Cactus and Toil latest commits.

fbemm commented 4 years ago

In:

shared/common.py

I changed:

base_singularity_call = ["singularity", "--silent", "run", os.environ["CACTUS_SINGULARITY_IMG"]]`

To:

if work_dir is None:
work_dir = os.getcwd()
base_singularity_call = ["singularity", "--silent", "run", "-H", format(os.path.abspath(work_dir)), os.environ["CACTUS_SINGULARITY_IMG"]]

Seems to work now but I am not sure that this is a proper fix.

fbemm commented 4 years ago

Getting a new error now:

INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.22.0a1-91eab2b3a8c29e10db3a35a2ad5053fb462849c3.
WARNING:toil.resource:'JTRES_a70dbd2ee1bb358d2bc673a5dc0fe069' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
Traceback (most recent call last):
  File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/worker.py", line 362, in workerScript
    with fileStore.open(job):
  File "/mnt/scratch/mock/envs/cactus/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/fileStores/nonCachingFileStore.py", line 69, in open
    self._removeDeadJobs(self.workFlowDir)
  File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/fileStores/nonCachingFileStore.py", line 196, in _removeDeadJobs
    if not cls._pidExists(jobState['jobPID']):
  File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/fileStores/abstractFileStore.py", line 463, in _pidExists
    os.kill(pid, 0)
OSError: [Errno 1] Operation not permitted
ERROR:toil.worker:Exiting the worker because of a failed job on host deei-bio86
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/instanceZzZphc with ID kind-LastzRepeatMaskJob/instanceZzZphc to 5
WARNING:toil.jobGraph:We have increased the default memory of the failed job 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/instanceZzZphc to 17179869184 bytes
fbemm commented 4 years ago

Related to --> https://github.com/DataBiosphere/toil/issues/1462

But I am not using a shared tmp currently.

adamnovak commented 4 years ago

@glennhickey did the Singularity support for Cactus, I think; maybe he can comment on why the mounts aren't working with your symlink setup? It could be you're ending up mounting a directory containing a symlink to something outside what is getting mounted.

The PID issue is a separate issue, and it isn't #1462 but it could be a little bit related. We're trying to use kill(pid, 0) to poll if a process is still alive, but it turns out that that fails if we wouldn't have permission to actually kill the process (for example, because the process did die and its PID was re-used by somebody else's process). Toil needs a try/except around that polling code; @fbemm can you report that as a Toil bug?