bottleneck in parallelization?

Hi, I am running the cactus cpu docker v2.0.3 for 12 ~3Gbp genomes on AWS, with a really big instance with a lot of memory and 90 cores. Cactus seems to parallelize well at the beginning, but after not long it gets stuck in using very little CPU. Is it just an inevitable computational bottlenecks that cactus has, or can I avoid this?

Here is the command and a bit of the logs:

cactus JobStoreworkloaF_v4 --retryCount 3 input.txt sol_v3.hal --workDir /data/ --maxCores 90 --maxDisk 450G --defaultMemory 350G --logFile sol_v4.log.txt

.... Issued job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-8ne5rxpf with job batch system ID: 6191 and cores: 1, disk: 2.0 Gi, and memory: 232.8 Gi Issued job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-914pr22c with job batch system ID: 6192 and cores: 1, disk: 2.0 Gi, and memory: 232.8 Gi 1 jobs are running, 4 jobs are issued and waiting to run 1 jobs are running, 4 jobs are issued and waiting to run 1 jobs are running, 4 jobs are issued and waiting to run 1 jobs are running, 4 jobs are issued and waiting to run ......

...... 2021-10-15T12:57:22+0000] [MainThread] [I] [toil.leader] Issued job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-vgip7c31 with job batch system ID: 6190 and cores: 1, disk: 2.0 Gi, and memory: 326.0 Gi [2021-10-15T12:57:22+0000] [MainThread] [I] [toil.leader] Issued job 'CactusPafAlign' kind-CactusPafAlign/instance-hbsetb3v with job batch system ID: 6191 and cores: 1, disk: 2.0 Gi, and memory: 3.3 Gi [2021-10-15T12:57:23+0000] [MainThread] [I] [toil.worker] Redirecting logging to /data/69107f4cb27750ea820cad38d3b6d6a0/f29c/worker_log.txt [2021-10-15T12:57:23+0000] [MainThread] [I] [toil.worker] Redirecting logging to /data/69107f4cb27750ea820cad38d3b6d6a0/b2f2/worker_log.txt [2021-10-15T12:57:24+0000] [MainThread] [I] [toil.leader] Issued job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-5o5o5jov with job batch system ID: 6192 and cores: 1, disk: 2.0 Gi, and memory: 326.0 Gi [2021-10-15T13:03:56+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T14:03:56+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T15:03:57+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T16:03:57+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T17:03:57+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T18:03:57+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T19:03:58+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T20:03:58+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T21:03:58+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T22:03:58+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-15T23:03:59+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T00:03:59+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T01:03:59+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T02:03:59+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T03:03:59+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T04:04:00+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T05:04:00+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T06:04:00+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T07:04:00+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run [2021-10-16T08:04:01+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 4 jobs are issued and waiting to run .......

Are you sure you are running from the 2.0.3 Docker? I'm pretty sure that version does not have the "CactudPafAlign" job that's in your log. That's off in a development branch that's never been merged into master.

It's hard to judge parallelism from the log. The cactus-consolidated binary is multithreaded. So one job can still be using all your cores. This one looks to have been going 20 hours which is on the slow side but it should finish eventually.

You are right, this attempt was on an earlier version. Thanks. But no, it is not using all my cores, it is literally using just 1, I checked that. In fact, it is stalled since days. Well, it finished 1 of the jobs, 3 more still waiting in line.

Am I correct in thinking that this is caused by the fact that it allocates an insane amount of memory to single jobs (326Gi)? Now I started a new instance with v2.0.3 specifying a more reasonable --defaultMemory: cactus /efs/data/JobStoreworkloaF_v203 --retryCount 3 /efs/chromosome_assemblies.solanaceae_genomeA.txt /efs/logs/sol_v3.hal --workDir /efs/data/ --restart --maxCores 85 --maxDisk 250G --maxMemory 230G --defaultMemory 3.5G --logFile /efs/logs/sol_v3.log.txt Could this fix this?

When I run it, I end up getting an error when cactus arrives at a similar point. The logs are below. My suspicion is that it has to do with some faulty recovery when we restart cactus. In fact, I experienced problems in restarting jobs on aws (though that works OK when I run cactus locally). Do you have ideas? 2021-10-20T17:50:12+0000] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'ProgressiveUp' kind-ProgressiveUp/instance-fv9tk9yx Exit reason: None [2021-10-20T17:50:12+0000] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'ProgressiveUp' kind-Progres siveUp/instance-fv9tk9yx [2021-10-20T17:50:12+0000] [MainThread] [W] [toil.leader] Log from job "kind-ProgressiveUp/instance-fv9tk9yx" follows: =========> [2021-10-20T17:50:11+0000] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG--- [2021-10-20T17:50:11+0000] [MainThread] [I] [toil] Running Toil version 5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7 on host ip-10-216- 9-104.eu-west-1.compute.internal. [2021-10-20T17:50:11+0000] [MainThread] [I] [toil.worker] Working on job 'ProgressiveUp' kind-ProgressiveUp/instance-fv9tk9yx [2021-10-20T17:50:11+0000] [MainThread] [I] [toil.worker] Loaded body Job('ProgressiveUp' kind-ProgressiveUp/instance-fv9tk9yx) from description 'ProgressiveUp' kind-ProgressiveUp/instance-fv9tk9yx [2021-10-20T17:50:12+0000] [MainThread] [I] [cactus.progressive.cactus_progressive] Progressive Up: Anc09 [2021-10-20T17:50:12+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2021-10-20T17:50:12+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-79238e32f3c541028fbeb8c78974ac87/cactus_progressive_config.xml' to path '/efs/data/46a1d395327053d8bf74822e249e9bf3/4d84/4a41/tmp6vqzjzvs.tmp' Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/toil/worker.py", line 393, in workerScript job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer) File "/usr/local/lib/python3.6/dist-packages/cactus/shared/common.py", line 933, in _runner fileStore=fileStore, **kwargs) File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 2360, in _runner returnValues = self._run(jobGraph=None, fileStore=fileStore) File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 2281, in _run return self.run(fileStore) File "/usr/local/lib/python3.6/dist-packages/cactus/progressive/cactus_progressive.py", line 156, in run experimentFile = fileStore.readGlobalFile(self.project.expIDMap[self.event]) File "/usr/local/lib/python3.6/dist-packages/toil/fileStores/nonCachingFileStore.py", line 91, in readGlobalFile self.jobStore.readFile(fileStoreID, localFilePath, symlink=symlink) File "/usr/local/lib/python3.6/dist-packages/toil/jobStores/fileJobStore.py", line 440, in readFile self._checkJobStoreFileID(jobStoreFileID) File "/usr/local/lib/python3.6/dist-packages/toil/jobStores/fileJobStore.py", line 724, in _checkJobStoreFileID raise NoSuchFileException(jobStoreFileID) toil.jobStores.abstractJobStore.NoSuchFileException: File 'files/no-job/file-867280e8c95c44ce8c833ca6c733def7/Anc09_experiment.xml' does not exist. [2021-10-20T17:50:12+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host ip-10-216-9-104.eu-west-1.compute.internal ..... Command exited with non-zero status 1 Command being timed: "cactus /efs/data/JobStoreworkloaF_v203 --retryCount 3 /efs/input.txt /efs/logs/sol_v3.hal --workDir /efs/data/ --restart --maxCores 85 --maxDisk 350G --maxMemory 230G --defaultMemory 20G --defaultDisk 20G --logFile /efs/logs/sol_v3.log.txt" User time (seconds): 24.77 System time (seconds): 40.14 Percent of CPU this job got: 169% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.28 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 145100 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 42 Minor (reclaiming a frame) page faults: 802050 Voluntary context switches: 30557 Involuntary context switches: 15631107 Swaps: 0 File system inputs: 10248 File system outputs: 2320 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 1

A few things

I wouldn't consider 326G "insane" memory in the context of 3G genomes, especially if long branch lengths are involved.
But you are probably getting these big memory jobs because of --defaultMemory 350G. Best to leave this option alone.
Your second crash is a problem with --restart, and unfortunately you're only way around is to either find that file or start from scratch. I think this is a bug and was also reported in #562, but I haven't managed to reproduce it.

Thank you. Actually I have this problem with restarting both on AWS and on LSF. One question: what do you mean "to find that file"? For instance, on LSF if I receive " toil.jobStores.abstractJobStore.NoSuchFileException: File 'files/no-job/file-6d2092ff56a949889958bd13eb4ada 52/Anc10_experiment.xml' does not exist." what should I do? The file in fact, is there:

find . -name 'Anc10_experiment.xml' ./jobStore.v41/files/no-job/file-6d2092ff56a949889958bd13eb4ada52/Anc10_experiment.xml What should I do with it?

I attach the log ex.v41-735060.txt

I also attach the screenshot of another file missing. The soft links to the files are there, linked from the tmp (not the files in tmp though).

Are you sure you are running from the 2.0.3 Docker? I'm pretty sure that version does not have the "CactudPafAlign" job that's in your log. That's off in a development branch that's never been merged into master.

It's hard to judge parallelism from the log. The cactus-consolidated binary is multithreaded. So one job can still be using all your cores. This one looks to have been going 20 hours which is on the slow side but it should finish eventually.

So, I started it again everything without restart, and now it seems to be going well on aws (still not managing on LSF). However, I have one question. You said that CactusPafAlig should not come up in docker v.2.0.3.

I see this however: 2021-10-15T12:57:20+0000] [MainThread] [I] [toil.leader] Issued job 'CactusPafAlign' kind-CactusPafAlign/instance-4qeib3j9 with job batch system ID: 6185 and cores: 1, disk: 2.0 Gi, and memory: 3.3 Gi

And if I do docker inspect -s on the same image I am running I see:

"Id": "sha256:48e5f6ac66d6176259467b308f128c501eb07b3edd45fd8ecf324f2478d8143e", "RepoTags": [ "781164228509.dkr.ecr.eu-west-1.amazonaws.com/sc-781164228509-pp-xp5cz7h65l2tq-hpcimages-ofntqkcfdysk:v2.0.3", "quay.io/comparative-genomics-toolkit/cactus:v2.0.3" ], "RepoDigests": [ "781164228509.dkr.ecr.eu-west-1.amazonaws.com/sc-781164228509-pp-xp5cz7h65l2tq-hpcimages-ofntqkcfdysk@sha256:50a59eb051ae1a61710b8de510f9a20f9dfc1ae70e9e5acfdc8f29b972514e90", "quay.io/comparative-genomics-toolkit/cactus@sha256:e3b9e19d611bd06ef04fe72b2b49f558f884605a4869bc83168c387366c7af05" ], "Parent": "", "Comment": "", "Created": "2021-07-22T22:16:38.542249274Z"

Very strange. But I do have to insist: If you're seeing "CactusPafAlign" in your log, you are not using v2.0.3. To wit:

docker run -ti --rm  quay.io/comparative-genomics-toolkit/cactus:v2.0.3 bash -c "find / -name "*.py" | xargs grep -i pafalign"
find: '/proc/11/task/11/net': Invalid argument
find: '/proc/11/net': Invalid argument

docker run -ti --rm  quay.io/comparative-genomics-toolkit/cactus:latest bash -c "find / -name "*.py" | xargs grep -i pafalign"
/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py:from cactus.pipeline.cactus_workflow import CactusPafAlign #CactusBlastPhase
/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py:            outWorkFlowArgs = toil.start(CactusPafAlign(standAlone=True, cactusWorkflowArguments=workFlowArgs, phaseName="blast"))
/usr/local/lib/python3.6/dist-packages/cactus/progressive/cactus_progressive.py:from cactus.pipeline.cactus_workflow import CactusPafAlign
/usr/local/lib/python3.6/dist-packages/cactus/progressive/cactus_progressive.py:        finalExpWrapper = self.addChild(CactusPafAlign(cactusWorkflowArguments=workFlowArgs, phaseName="blast")).rv()
/usr/local/lib/python3.6/dist-packages/cactus/pipeline/cactus_workflow.py:class CactusPafAlign(CactusPhasesJob):
/usr/local/lib/python3.6/dist-packages/cactus/pipeline/cactus_workflow.py:        super(CactusPafAlign, self).__init__(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/cactus/pipeline/cactus_workflow.py:        self.addFollowOn(CactusPafAlign(cactusWorkflowArguments=self.cactusWorkflowArguments, phaseName="blast"))
find: '/proc/11/task/11/net': Invalid argument
find: '/proc/11/net': Invalid argument

Somehow, you must be using cactus:latest or another recent image. This is untested code which you really want to avoid using.

On our side, I think we should avoid setting the :latest tag to the most recent commit to go through CI (and perhaps reserve it for the master branch?).

What about the other question? Like the inability to restart: You say "find those tmp files"? What do you mean? Since those files are pointing to tmp to non-existant locations in the tmp, what can one do? Would it help to allow for deletion of jobs only if successful?

ps: maybe you should correct your --help to add explicitly what is the default between linkImports and noLinkImports and so on: you have the same explanation for both (see below). I assume that noLinkImports and noMoveExports have the correct explanation, while linkImports and moveExports are the default, but I might be wrong. Could these options help with the "restart/file not found" issue, since those are broken simlinks?

Also, 8Ei in maxDisk stands for Gi?

--linkImports When using a filesystem based job store, CWL input files are by default symlinked in. Specifying this option instead copies the files into the job store, which may protect them from being modified externally. When not specified and as long as caching is enabled, Toil will protect the file automatically by changing the permissions to read-only. --noLinkImports When using a filesystem based job store, CWL input files are by default symlinked in. Specifying this option instead copies the files into the job store, which may protect them from being modified externally. When not specified and as long as caching is enabled, Toil will protect the file automatically by changing the permissions to read-only. --moveExports When using a filesystem based job store, output files are by default moved to the output directory, and a symlink to the moved exported file is created at the initial location. Specifying this option instead copies the files into the output directory. Applies to filesystem-based job stores only. --noMoveExports When using a filesystem based job store, output files are by default moved to the output directory, and a symlink to the moved exported file is created at the initial location. Specifying this option instead copies the files into the output directory. Applies to filesystem-based job stores only.

I solved both problems in the end. For the bottleneck it was sufficient to adjust the memory settings. It was hard because of these bugs that prevent restarting on some platforms. But I turned around that using the flags noLinkImports, noMoveExports and clearWorkdir. Thanks, I close the issue.

ComparativeGenomicsToolkit / cactus

bottleneck in parallelization? #580