Open fbemm opened 4 years ago
I just checked. Something like this works:
cd /tmp/cactus/toil-889d8897-ce9d-4a9f-a182-3f3197e96d05-5966802246d6bf3f6656afc558de6230/tmpFZmgW0/9075e680-e7e3-433e-a2a9-aa4f8213afc5
singularity --silent run -B ./:/test cactus.img cactus_analyseAssembly /test/tmpBNzr03.tmp
Wouldn't it be best to bind the docker work dir to something different from $HOME?
Same happens if I use a totally shared /tmp. The docker working dir is not passed to to the container.
singularity run -B /shared/tmp/toil-036a771b-ea26-41d2-88e6-60c840797b69-40b99ca6183ff7196bb6253b57066c40/tmp7yoPZf/308164b4-3f95-48be-b449-094abc3c4028:/data cactus.img cactus_analyseAssembly /data/tmpLUOfsU.tmp
The last one work again.
Singularity is 2.4.2. Cactus and Toil latest commits.
In:
shared/common.py
I changed:
base_singularity_call = ["singularity", "--silent", "run", os.environ["CACTUS_SINGULARITY_IMG"]]`
To:
if work_dir is None:
work_dir = os.getcwd()
base_singularity_call = ["singularity", "--silent", "run", "-H", format(os.path.abspath(work_dir)), os.environ["CACTUS_SINGULARITY_IMG"]]
Seems to work now but I am not sure that this is a proper fix.
Getting a new error now:
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.22.0a1-91eab2b3a8c29e10db3a35a2ad5053fb462849c3.
WARNING:toil.resource:'JTRES_a70dbd2ee1bb358d2bc673a5dc0fe069' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
Traceback (most recent call last):
File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/worker.py", line 362, in workerScript
with fileStore.open(job):
File "/mnt/scratch/mock/envs/cactus/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/fileStores/nonCachingFileStore.py", line 69, in open
self._removeDeadJobs(self.workFlowDir)
File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/fileStores/nonCachingFileStore.py", line 196, in _removeDeadJobs
if not cls._pidExists(jobState['jobPID']):
File "/mnt/scratch/mock/envs/cactus/lib/python2.7/site-packages/toil-3.22.0a1-py2.7.egg/toil/fileStores/abstractFileStore.py", line 463, in _pidExists
os.kill(pid, 0)
OSError: [Errno 1] Operation not permitted
ERROR:toil.worker:Exiting the worker because of a failed job on host deei-bio86
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/instanceZzZphc with ID kind-LastzRepeatMaskJob/instanceZzZphc to 5
WARNING:toil.jobGraph:We have increased the default memory of the failed job 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/instanceZzZphc to 17179869184 bytes
Related to --> https://github.com/DataBiosphere/toil/issues/1462
But I am not using a shared tmp currently.
@glennhickey did the Singularity support for Cactus, I think; maybe he can comment on why the mounts aren't working with your symlink setup? It could be you're ending up mounting a directory containing a symlink to something outside what is getting mounted.
The PID issue is a separate issue, and it isn't #1462 but it could be a little bit related. We're trying to use kill(pid, 0)
to poll if a process is still alive, but it turns out that that fails if we wouldn't have permission to actually kill the process (for example, because the process did die and its PID was re-used by somebody else's process). Toil needs a try/except around that polling code; @fbemm can you report that as a Toil bug?
I am trying to run Cactus using Singularity and LSF. It works but since we barely have /tmp storage at some point Cactus crashes. Mostly during merging or quality rescoring steps due to insufficient disk space.
I am now trying to use a shared tmp. Each node has a /tmp/cactus now. The symlink points to a node-specific folder on a fast shared storage. The errors that I get now look the following:
I call Cactus like this:
Files are all created properly on '/tmp/cactus' but Singularity seem to have no access to them. I can execute the Singularity command standalone and get the same result.
This is likely caused by Singularity not seing /tmp/cactus. Would "--bind" fix that problem?