ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

!!!!run failure after 2 weeks #1368

Open song984888 opened 2 months ago

song984888 commented 2 months ago

Using cactus-v2.8.1, we run a job with 28 plant genomes. Two weeks of job running, it failed with error log:

Tree: (((((((((((((((((((Juglans_californica:1.0,Juglans_nigra:1.0)Anc26:1.0,Juglans_mandshurica:1.0)Anc24:1.0,(Cyclocarya_paliurus:1. Outgroups: Anc26: ['Juglans_mandshurica', 'Cyclocarya_paliurus'] Anc24: ['Anc25', 'Anc22'] Anc25: ['Juglans_mandshurica', 'Platycarya_strobilacea'] Anc21: ['Anc22', 'Anc20'] Anc22: ['Carya_illinoinensis', 'Juglans_mandshurica'] Anc19: ['Anc20', 'Anc18'] Anc23: ['Carya_illinoinensis', 'Rhoiptelea_chiliantha'] Anc20: ['Anc18', 'Anc22'] Anc17: ['Anc18', 'Rhoiptelea_chiliantha'] Anc18: ['Rhoiptelea_chiliantha', 'Morella_rubrai'] Anc16: ['Rhoiptelea_chiliantha', 'Morella_rubrai'] Anc15: ['Morella_rubrai', 'Anc14'] Anc13: ['Anc14', 'Anc12'] Anc14: ['Morella_rubrai', 'Cucumis_sativus'] Anc11: ['Anc12', 'Cucumis_sativus'] Anc12: ['Cucumis_sativus', 'Fragaria_vesca'] Anc10: ['Cucumis_sativus', 'Fragaria_vesca'] Anc09: ['Fragaria_vesca', 'Glycine_max'] Anc08: ['Glycine_max', 'Arabidopsis_thaliana'] Anc07: ['Arabidopsis_thaliana', 'Vitis_vinifera'] Anc06: ['Vitis_vinifera', 'Anc05'] Anc04: ['Anc05', 'Aquilegia_coerulea'] Anc05: ['Aquilegia_coerulea', 'Vitis_vinifera'] Anc03: ['Aquilegia_coerulea', 'Acorus_gramineus'] Anc02: ['Acorus_gramineus', 'Amborella_trichopoda'] Anc01: ['Amborella_trichopoda'] ......... cactus_consolidated(Anc07): Ran cactus bar (use poa:1), 24849 seconds have elapsed cactus_consolidated(Anc07): There are 9 layers in the flowers hierarchy cactus_consolidated(Anc07): In the 0 layer there are 1 flowers in the flowers hierarchy cactus_consolidated(Anc07): Chose reference event 4: Anc07 cactus_consolidated(Anc07): For flower: 0 we have 31534 nodes for: 469886 ends, 44517 chains, 10 stubs and 180488 blocks cactus_consolidated(Anc07): Building a matching for 10 stub nodes in the top level problem from 108910 total stubs of which 51728 atta cactus_consolidated(Anc07): Starting to build the reference for flower 0, with 5 stubs and 31524 chains and 31534 nodes in the flowers cactus_consolidated(Anc07): The score of the initial solution is 157050826644.340210/16361 out of a max possible 205080751263.141541 cactus_consolidated(Anc07): The score of the solution after permutation sampling is 160144571568.204498/15487 after 10 rounds of greed cactus_consolidated(Anc07): The score of the final reference solution is 160144571568.204498/15436 after 100 rounds of greedy nudging cactus_consolidated(Anc07): In the 1 layer there are 176027 flowers in the flowers hierarchy cactus_consolidated(Anc07): In the 2 layer there are 1001012 flowers in the flowers hierarchy cactus_consolidated(Anc07): In the 3 layer there are 468867 flowers in the flowers hierarchy cactus_consolidated(Anc07): In the 4 layer there are 88707 flowers in the flowers hierarchy cactus_consolidated(Anc07): In the 5 layer there are 10483 flowers in the flowers hierarchy cactus_consolidated(Anc07): In the 6 layer there are 1056 flowers in the flowers hierarchy cactus_consolidated(Anc07): In the 7 layer there are 70 flowers in the flowers hierarchy cactus_consolidated(Anc07): In the 8 layer there are 16 flowers in the flowers hierarchy cactus_consolidated(Anc07): Ran cactus make reference, 25480 seconds have elapsed cactus_consolidated(Anc07): Ran cactus make reference bottom up coordinates, 25571 seconds have elapsed cactus_consolidated(Anc07): Ran cactus make reference top down coordinates, 25579 seconds have elapsed cactus_consolidated(Anc07): Ran cactus to hal stage, 25598 seconds have elapsed cactus_consolidated(Anc07): Dumped sequences for hal file, 25608 seconds have elapsed cactus_consolidated(Anc07): Dumped reference sequences, 25609 seconds have elapsed cactus_consolidated(Anc07): Cactus consolidated is done!, 25609 seconds have elapsed 2024-04-25 11:37:08.728719: Successfully ran cactus_consolidated(Anc07): "cactus_consolidated --sequences Anc08 /panfs4/gpu/home/songh Issued job 'clean_jobstore_files' kind-clean_jobstore_files/instance-3iewa2r4 v1 with job batch system ID: 16371 and disk: 2.0 Gi, mem Got message from job at time 04-25-2024 11:37:18: Ran cactus consolidated okay Finished toil run with 28 failed jobs. Failed jobs at end of the run: 'progressive_step' kind-progressive_step/instance-k8o10uwn v2 'progressive_step' kind-progressive_step/ Stopping real-time logging server. Joining real-time logging server thread.

........ [2024-04-20T02:39:43+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Sequence graph statistics after melting: [2024-04-20T02:39:44+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): There were 299339 blocks in the sequence graph, represe [2024-04-20T02:39:44+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Block degree stats: min 1, avg 2.507926, median 2, max [2024-04-20T02:39:44+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Block support stats: min 0.000000, avg 0.529050, median [2024-04-20T02:39:48+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Pinch graph component with 99171 nodes and 129426 edges [2024-04-20T02:39:51+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Attaching the sequence to the cactus root 16, header SL [2024-04-20T02:40:25+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Ran cactus caf, 1694 seconds have elapsed [2024-04-20T02:40:39+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Ran extended flowers ready for bar, 1708 seconds have e [2024-04-20T05:03:15+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): [SIMDMalloc] posix_memalign fail! [2024-04-20T05:03:15+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Size: 4294967296, Error: ENOMEM [2024-04-20T05:03:42+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2024-04-20T05:03:42+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-RedMaskJob/instance-s [2024-04-20T05:03:42+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-RedMaskJob/instance-k [2024-04-20T05:03:42+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-trim_unaligned_sequen [2024-04-20T05:03:42+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-trim_unaligned_sequen [2024-04-20T05:03:42+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-trim_unaligned_sequen Traceback (most recent call last): File "/panfs4/gpu/home/songht/miniconda3/lib/python3.8/site-packages/toil/worker.py", line 407, in workerScript job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer) File "/panfs4/gpu/home/songht/miniconda3/lib/python3.8/site-packages/toil/job.py", line 2829, in _runner returnValues = self._run(jobGraph=None, fileStore=fileStore) File "/panfs4/gpu/home/songht/miniconda3/lib/python3.8/site-packages/toil/job.py", line 2746, in _run return self.run(fileStore) File "/panfs4/gpu/home/songht/miniconda3/lib/python3.8/site-packages/toil/job.py", line 2974, in run rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs) File "/panfs4/gpu/home/songht/miniconda3/lib/python3.8/site-packages/cactus/pipeline/cactus_workflow.py", line 143, in cactus_cons messages = cactus_call(check_output=True, returnStdErr=True, File "/panfs4/gpu/home/songht/miniconda3/lib/python3.8/site-packages/cactus/shared/common.py", line 906, in cactus_call raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out)) RuntimeError: cactus_consolidated(Anc05): Command ['cactus_consolidated', '--sequences', 'Solanum_lycopersicum /panfs4/gpu/home/songht/pl [2024-04-20T05:03:43+0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host admin186 <=========

glennhickey commented 2 months ago

Here's the important part of the log

2024-04-20T05:03:15+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): [SIMDMalloc] posix_memalign fail!
[2024-04-20T05:03:15+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Size: 4294967296, Error: ENOMEM

Looks like it's running out of memory in abPOA. How much memory do you have? If it was running multiple jobs at once, you can run with --restart to try again, with hopefully more memory free.

song984888 commented 2 months ago

Here's the important part of the log

2024-04-20T05:03:15+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): [SIMDMalloc] posix_memalign fail!
[2024-04-20T05:03:15+0800] [MainThread] [I] [toil-rt] cactus_consolidated(Anc05): Size: 4294967296, Error: ENOMEM

Looks like it's running out of memory in abPOA. How much memory do you have? If it was running multiple jobs at once, you can run with --restart to try again, with hopefully more memory free.

Thank you, now we use machine with 283G memory. i try to restart with larger memory in another machine.