ERROR：Exiting the worker because of a failed job on host xxx

For alignment of 159 query bases, 161 target bases and 159 aligned bases trimming 1 bases from each paf end
For alignment of 160 query bases, 161 target bases and 160 aligned bases trimming 1 bases from each paf end
For alignment of 318 query bases, 306 target bases and 304 aligned bases trimming 3 bases from each paf end
For alignment of 478 query bases, 474 target bases and 473 aligned bases trimming 4 bases from each paf end
For alignment of 57 query bases, 57 target bases and 57 aligned bases trimming 0 bases from each paf end

[2022-12-25T01:16:36+0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host gene2

<========= [2022-12-25T01:17:03+0800] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'chain_alignments' kind-chainalignments/instance-etrn2p3 v2 with ID kind-chainalignments/instance-etrn2p3 to 1 [2022-12-25T01:17:03+0800] [MainThread] [D] [toil.job] New job version: 'chain_alignments' kind-chainalignments/instance-etrn2p3 v3 [2022-12-25T01:17:03+0800] [MainThread] [D] [toil.bus] processFinishedJob sent: JobUpdatedMessage(job_id='kind-chainalignments/instance-etrn2p3', result_status=1) [2022-12-25T01:17:03+0800] [MainThread] [D] [toil.leader] Added job: 'chain_alignments' kind-chainalignments/instance-etrn2p3 v3 to updated jobs [2022-12-25T01:17:03+0800] [MainThread] [D] [toil.leader] Built the jobs list, currently have 1 jobs to update and 11 jobs issued [2022-12-25T01:17:03+0800] [MainThread] [D] [toil.leader] Updating status of job 'chain_alignments' kind-chainalignments/instance-etrn2p3 v3 with result status: 1 [2022-12-25T01:17:03+0800] [MainThread] [D] [toil.batchSystems.singleMachine] Issuing the command: _toil_worker chain_alignments file:/media/zyh/disk2/cactus/js kind-chainalignments/instance-etrn2p3 with memory: 2147483648, cores: 1.0, disk: 17047442240 [2022-12-25T01:17:03+0800] [MainThread] [I] [toil.leader] Issued job 'chain_alignments' kind-chainalignments/instance-etrn2p3 v3 with job batch system ID: 217103 and cores: 1, disk: 15.9 Gi, and memory: 2.0 Gi [2022-12-25T01:17:03+0800] [Thread-1 (daddy)] [D] [toil.batchSystems.singleMachine] Launched job 217103 as child 3027969 [2022-12-25T01:17:06+0800] [MainThread] [D] [toil.statsAndLogging] Suppressing the following loggers: {'setuptools', 'websocket', 'urllib3', 'boto3', 'charset_normalizer', 'docker', 'botocore', 'pkg_resources', 'dill', 'requests', 'bcdocs', 'boto'} [2022-12-25T01:17:07+0800] [MainThread] [D] [toil.common] Obtained node ID b0d275fb46fa48c6820be57edaa22cf5 from file /var/lib/dbus/machine-id [2022-12-25T01:17:07+0800] [MainThread] [I] [toil.worker] Redirecting logging to /media/zyh/disk2/cactus/workDir/6cd9b813a70f5b6394558d9dda047e04/3f30/worker_log.txt [2022-12-25T01:17:17+0800] [MainThread] [I] [toil-rt] 2022-12-25 01:17:17.351799: Running the command: "cat /media/zyh/disk2/cactus/workDir/6cd9b813a70f5b6394558d9dda047e04/3f30/9381/tmp7vry5s22.tmp" [2022-12-25T01:17:56+0800] [MainThread] [D] [toil.job] New job version: 'chain_alignments' kind-chain_alignments/instance-fq5bffid v2 [2022-12-25T01:18:00+0800] [MainThread] [D] [toil.deferred] Removing own state file /var/run/user/1000/toil/6cd9b813a70f5b6394558d9dda047e04/deferred/func5l9k8412 [2022-12-25T01:18:05+0800] [Thread-1 (daddy)] [E] [toil.batchSystems.singleMachine] Got exit code 1 (indicating failure) from job _toil_worker chain_alignments file:/media/zyh/disk2/cactus/js kind-chain_alignments/instance-fq5bffid. [2022-12-25T01:18:05+0800] [Thread-1 (daddy)] [D] [toil.batchSystems.singleMachine] Child 3027657 for job 217086 succeeded [2022-12-25T01:18:05+0800] [MainThread] [D] [toil.batchSystems.singleMachine] Ran jobID: 217086 with exit value: 1 [2022-12-25T01:18:05+0800] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'chain_alignments' kind-chain_alignments/instance-fq5bffid v1 Exit reason: None [2022-12-25T01:18:05+0800] [MainThread] [D] [toil.leader] Job 'chain_alignments' kind-chain_alignments/instance-fq5bffid v1 continues to exist (i.e. has more to do) [2022-12-25T01:18:05+0800] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'chain_alignments' kind-chain_alignments/instance-fq5bffid v2 [2022-12-25T01:18:05+0800] [MainThread] [W] [toil.leader] Log from job "kind-chain_alignments/instance-fq5bffid" follows: =========> paf end For alignment of 150 query bases, 152 target bases and 150 aligned bases trimming 1 bases from each paf end For alignment of 566 query bases, 566 target bases and 566 aligned bases trimming 5 bases from each paf end For alignment of 203 query bases, 203 target bases and 203 aligned bases trimming 2 bases from each paf end For alignment of 135 query bases, 135 target bases and 135 aligned bases trimming 1 bases from each paf end For alignment of 690 query bases, 696 target bases and 679 aligned bases trimming 6 bases from each paf end For alignment of 238 query bases, 254 target bases and 228 aligned bases trimming 2 bases from each paf end

I am running a multi-genome comparison of plant genomes, and this problem occurs. In the end, there is no hal file output. What is the reason? This failure appeared many times at the end of the program's running log.

My command is :cactus ./js ./examples2/evolverMammals.txt /media/name/disk2/evolverMammals.hal --workDir /media/name/disk2/workDir2

Do I need to create the evolveMammals.hal file in advance under the path /media/name/disk2? Or will it be automatically generated, does this path have any effect on the generation of no hal files?

Can you share your whole log? The reason for the crash doesn't seem to be in the bit you posted.

Can you share your whole log? The reason for the crash doesn't seem to be in the bit you posted.

This complete log is relatively large, about 3.4g, I don’t know how to send it to you, so I store it in Google diver, you can watch it at [whole log](https://drive.google.com/file/d/1TKtKnIbKg4uqo2hDJIbYAR---cY5ayUl/view?usp=sharing） I guess Command exited with non-zero status 1 14784716.00user 226222.44system 281:00:55elapsed 1483%CPU (0avgtext+0avgdata 128830548maxresident)k 121749231008inputs+122305153672outputs (6150964major+22215051697minor)pagefaults 0swaps Is the task failure caused by my lack of memory? My computer has 128g of memory and 36 cores. The data I am running is 26 maize genomes. If this is the problem, how many threads should I take to run it? Does cactus have the function of continuing to run after failure. Can I continue where it failed? Can I specify the number of cores to run with the --consCores parameter?

Can you share your whole log? The reason for the crash doesn't seem to be in the bit you posted. Each genome has a size of about 2.2g, and there are 26 in total. If I use 128 memory, how many cores should I set so that it does not crash? If I use a cluster, do you have any suggestions?

i re-use /usr/bin/time cactus ./js ./examples2/evolverMammals.txt /media/name/disk2/evolverMammals.hal --workDir /media/name/disk2/workDir2 --restart > cactus_err2.log 2>&1 Appear

For alignment of 1524 query bases, 1460 target bases and 1379 aligned bases trimming 13 bases from each paf end
For alignment of 609 query bases, 659 target bases and 593 aligned bases trimming 5 bases from each paf end
For alignment of 106 query bases, 100 target bases and 100 aligned bases trimming 1 bases from each paf end
Paf chain is done!, 314 seconds have elapsed
[2023-01-04T20:48:06+0800] [MainThread] [W] [root] Deprecated toil method.  Please call "logging.getLevelName" directly.
[2023-01-04T20:48:06+0800] [MainThread] [I] [toil-rt] 2023-01-04 20:48:06.604244: Running the command: "paf_tile -i /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/2d4b/24c0/tmpvbltutsx.tmp --logLevel DEBUG"
[2023-01-04T20:59:47+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-01-04T20:59:47+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-combine_chunks/instance-igjfycut/file-5980444c48ef4b0a9819b2ae224e3cf5/tmpjqax1ayr.tmp' to path '/media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/2d4b/24c0/tmpb6rqse_6.tmp'
[2023-01-04T20:59:47+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_ingroup_to_outgroup_alignments_3/instance-haq4pzhw/file-5dfbc0fe3596440bbe7d30e7347912d7/tmpqn43g3b8.tmp' to path '/media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/2d4b/24c0/tmpxj0qtayv.tmp'
[2023-01-04T20:59:47+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_ingroup_to_outgroup_alignments_3/instance-yl6un7tv/file-a439ec11fd43420b81bcb813a2ce765d/tmpmhuuypt6.tmp' to path '/media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/2d4b/24c0/tmp_4twd_f2.tmp'
[2023-01-04T20:59:48+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job 'chain_alignments' kind-chain_alignments/instance-k7ggcokj v9 used 171.36% disk (27.2 GiB [29230661632B] used, 15.9 GiB [17057600768B] requested).
[2023-01-04T21:00:07+0800] [MainThread] [D] [toil.deferred] Running own deferred functions
[2023-01-04T21:00:07+0800] [MainThread] [D] [toil.deferred] Out of deferred functions!
[2023-01-04T21:00:07+0800] [MainThread] [D] [toil.deferred] Running orphaned deferred functions
[2023-01-04T21:00:07+0800] [MainThread] [D] [toil.deferred] Ran orphaned deferred functions from 0 abandoned state files
Traceback (most recent call last):
  File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 407, in workerScript
    job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
  File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2406, in _runner
    returnValues = self._run(jobGraph=None, fileStore=fileStore)
  File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2324, in _run
    return self.run(fileStore)
  File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2547, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/cactus/paf/local_alignment.py", line 245, in chain_alignments
    messages = cactus_call(parameters=['paf_tile', "-i", chained_alignment_file, "--logLevel", getLogLevelString()],
  File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/cactus/shared/common.py", line 816, in cactus_call
    raise RuntimeError("{}Command {} signaled {}: {}".format(sigill_msg, call, signal.Signals(-process.returncode).name, out))
RuntimeError: Command ['paf_tile', '-i', '/media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/2d4b/24c0/tmpvbltutsx.tmp', '--logLevel', 'DEBUG'] signaled SIGKILL: stdout=None, stderr=Input file string : /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/2d4b/24c0/tmpvbltutsx.tmp
Output file string : (null)

[2023-01-04T21:00:07+0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host gene2

<========= [2023-01-04T21:00:12+0800] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'chain_alignments' kind-chain_alignments/instance-k7ggcokj v10 with ID kind-chain_alignments/instance-k7ggcokj to 0 [2023-01-04T21:00:12+0800] [MainThread] [W] [toil.leader] Job 'chain_alignments' kind-chain_alignments/instance-k7ggcokj v11 is completely failed [2023-01-04T21:03:15+0800] [MainThread] [I] [toil-rt] 2023-01-04 21:03:15.142117: Successfully ran: "paf_chain -i /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/3290/59ec/tmpezaoz_1b.tmp --maxGapLength 1000000 --chainGapOpen 5000 --chainGapExtend 1 --trimFraction 0.02 --logLevel DEBUG" in 442.6981 seconds [2023-01-04T21:03:21+0800] [MainThread] [I] [toil-rt] 2023-01-04 21:03:21.092439: Running the command: "cat /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/3290/59ec/tmpauirzuj0.tmp" [2023-01-04T21:03:41+0800] [MainThread] [I] [toil-rt] 2023-01-04 21:03:41.663339: Successfully ran: "cat /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/3290/59ec/tmpauirzuj0.tmp" in 20.5693 seconds [2023-01-04T21:03:41+0800] [MainThread] [I] [toil-rt] 2023-01-04 21:03:41.663925: Running the command: "paf_invert -i /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/3290/59ec/tmpauirzuj0.tmp" [2023-01-04T21:04:38+0800] [MainThread] [I] [toil-rt] 2023-01-04 21:04:38.996683: Successfully ran: "paf_invert -i /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/3290/59ec/tmpauirzuj0.tmp" in 57.3203 seconds [2023-01-04T21:04:38+0800] [MainThread] [I] [toil-rt] 2023-01-04 21:04:38.997902: Running the command: "paf_chain -i /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/3290/59ec/tmpzyb8ej87.tmp --maxGapLength 1000000 --chainGapOpen 5000 --chainGapExtend 1 --trimFraction 0.02 --logLevel DEBUG"

At this time, cactus has not stopped, but the log log is already outputting errors. Excuse me, what is this question?

Hi, if paf_tile is failing, then it is most likely running out of memory as you say. I suspect a situation similar to #845 where insufficiently masked genomes are leading to too many pairwise lastz alignments which is in turn swamping the paf handling code.

How did you softmask your input genomes?

Hi, if paf_tile is failing, then it is most likely running out of memory as you say. I suspect a situation similar to #845 where insufficiently masked genomes are leading to too many pairwise lastz alignments which is in turn swamping the paf handling code.

How did you softmask your input genomes? Yes, I did not do the softmask operation, I thought there would be this step in cactus. Mine just extracts the non-scaf DNA sequence. I read your reply, #413 I tried --maxServiceJobs 1 again, it is running now, I wonder if it can solve this problem? Do I need to re-softmask the genome before using cactus? I just use Command /usr/bin/time cactus ./js ./examples2/evolverMammals.txt /media/name/disk2/evolverMammals.hal --workDir /media/name/disk2/workDir2 > cactus_err2.log 2>&1，What should I do to solve this problem, I am now trying the command /usr/bin/time cactus ./js ./examples2/evolverMammals.txt /media/name/disk2/evolverMammals.hal --workDir /media/name/disk2/workDir2 --maxServiceJobs 1 --restart > cactus_err2.log 2>&1

Hi, if paf_tile is failing, then it is most likely running out of memory as you say. I suspect a situation similar to #845 where insufficiently masked genomes are leading to too many pairwise lastz alignments which is in turn swamping the paf handling code.

How did you softmask your input genomes?

[2023-01-05T00:09:49+0800] [MainThread] [D] [toil.deferred] Ran orphaned deferred functions from 0 abandoned state files Traceback (most recent call last): File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 407, in workerScript job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer) File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2406, in _runner returnValues = self._run(jobGraph=None, fileStore=fileStore) File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2324, in _run return self.run(fileStore) File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2547, in run rValue = userFunction(*((self,) + tuple(self._args)), self._kwargs) File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/cactus/paf/local_alignment.py", line 245, in chain_alignments messages = cactus_call(parameters=['paf_tile', "-i", chained_alignment_file, "--logLevel", getLogLevelString()], File "/media/zyh/disk2/cactus/cactus_env/lib/python3.10/site-packages/cactus/shared/common.py", line 816, in cactus_call raise RuntimeError("{}Command {} signaled {}: {}".format(sigill_msg, call, signal.Signals(-process.returncode).name, out)) RuntimeError: Command ['paf_tile', '-i', '/media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/6346/5e59/tmp5j32hxwv.tmp', '--logLevel', 'DEBUG'] signaled SIGKILL: stdout=None, stderr=Input file string : /media/zyh/disk2/workDir2/6cd9b813a70f5b6394558d9dda047e04/6346/5e59/tmp5j32hxwv.tmp Output file string : (null) Using --maxServiceJobs 1 is still wrong**

Yes, you need to softmask your genome before running cactus. Even if cactus does its own additional softmasking in the preprocessor.

This is a requirement I'd like to get rid of, but that won't happen for at least a few months.

The best way (for many species) is to use RepeatMasker. Genomes downloaded from UCSC are usually masked with RepeatMasker.

Yes, you need to softmask your genome before running cactus. Even if cactus does its own additional softmasking in the preprocessor.

This is a requirement I'd like to get rid of, but that won't happen for at least a few months.

The best way (for many species) is to use RepeatMasker. Genomes downloaded from UCSC are usually masked with RepeatMasker. Thank you very much for your guidance. I will do a softmask operation on them first, and then run cactus. After using the softmask operation, do I need to specify --maxserviceJobs 1 and other operations to reduce its memory? or defaultMemory

Yes, you need to softmask your genome before running cactus. Even if cactus does its own additional softmasking in the preprocessor.

This is a requirement I'd like to get rid of, but that won't happen for at least a few months.

The best way (for many species) is to use RepeatMasker. Genomes downloaded from UCSC are usually masked with RepeatMasker.

Yes, you need to softmask your genome before running cactus. Even if cactus does its own additional softmasking in the preprocessor.

This is a requirement I'd like to get rid of, but that won't happen for at least a few months.

The best way (for many species) is to use RepeatMasker. Genomes downloaded from UCSC are usually masked with RepeatMasker. Can you give me some advice to make him run cactus with lower memory? Is --maxMemory effective? I am on a machine with 128G memory, specify --maxMemory 90G, plus softmask before running, is it feasible? I used --maxServiveceJobs 1, so far, according to his thread, it has not reduced the number of concurrency. I hope you can give me an answer.

ComparativeGenomicsToolkit / cactus

ERROR：Exiting the worker because of a failed job on host xxx #877