DessimozLab / FastOMA

FastOMA is a scalable software package to infer orthology relationship.
Mozilla Public License 2.0
32 stars 7 forks source link

Recursion limit reached #31

Open Simarpreet-Kaur-Bhurji opened 3 months ago

Simarpreet-Kaur-Bhurji commented 3 months ago

Hello, While running FastOMA on 2200 species, I encountered another mafft segmentation fault but when I resumed nextflow it seemed to not complain about the segmentation fault but I got the recursion limit reached error. Please find attached the log file of the run herewith. Do you know what's going on?

recurrsion_depth_err.log

alpae commented 3 months ago

Hi @Simarpreet-Kaur-Bhurji,

looks like this happened during the communication between different threads. not sure what goes on there exactly. could you share with us the whole work folder of that failing step (/hps/nobackup/flicek/ensembl/compara/sbhurji/Development/fastoma_run/work/18/6d8a8694445b6226830f618af3bf2f), including the data for the roothog D0138574. Probably something like this should work: cd /hps/nobackup/flicek/ensembl/compara/sbhurji/Development/fastoma_run/work/18/6d8a8694445b6226830f618af3bf2f; tar -cvzhf dump.tgz . should work.

Simarpreet-Kaur-Bhurji commented 3 months ago

Hi Adrian, thank you for getting in touch. I kept the log message but I have unfortunately deleted the work directory in anticipation of the rerun. In the meanwhile I will rerun it and will let you know when I reach this issue again?

sinamajidian commented 3 months ago

No worries. For future it would be also helpful to know whether the task ran out of memory or not. I see a case where Segmentation fault happened due to lack of enough memory. FastOMA by default retries three times increasing the allocated memory with slrum job.

To check the slrum job, you could see the relevant work folder, and find its job name and job ID (e.g. with sacct), which could be used with seff to see whether it ran out of memory or not (please see end of this wiki for an example).

$ head -n2 .command.run 
#!/bin/bash
#SBATCH -J nf-hog_rest_(25)
Simarpreet-Kaur-Bhurji commented 3 months ago

Thank you that is helpful, I will check it for this run.

Simarpreet-Kaur-Bhurji commented 2 months ago

Hi Sina, I have run the pipeline again, I hit the mafft segmentation fault and bumped the memory based on your previous suggestion. After that it seemed to be running for 2 days and now it has again failed with recursion limit reached error. Please find attached the work folder herewith. Let me know if you need any other details. Thank you. dump.tgz

sinamajidian commented 2 months ago

Hi Simarpreet The fix had been on another branch and I think you ran the same code. Adrian just updated the main branch. So the latest code shouldn't hit the recursion limit. In order to save time/computation, you can run only this rootHOG (using the .command.sh) to see the problem is solved or not.

Simarpreet-Kaur-Bhurji commented 2 months ago

Hi Sina, I pulled the lastest changes and rerun but I still got the recursion limit reached error. Do you think it is to do with the data given Triticum aestivum is usually troublesome because of it's size? PFA the word dir herewith. wheat_roothog_dir.tgz

alpae commented 2 months ago

Hi @Simarpreet-Kaur-Bhurji ,

I've uploaded a fix for this issue (hopefully this time for real). you could try it by updating the repo with the dev branch and submitting the .command.run from the failing work-directory. if you use containers, you should ensure that the dessimozlab/fastoma:sha-1aa97b8 e.g. docker pull dessimozlab/fastoma:sha-1aa97b8 is used. please let us know if this fixes your issue.

Simarpreet-Kaur-Bhurji commented 2 months ago

Hey Adrian, thank you for looking into this. At the moment our servers are under scheduled maintenance I will let you know if this fixes it. Thank you. Would request to keep this issue open until then.

sinamajidian commented 2 months ago

Btw, if you share the fasta file of rootHOG ( inside the folder fastoma_run/work/30/45bab08427770d06e1b9e5f1f5d282/rhogs_big/58) with us, I can run on it and make sure the issue is resolved.

Simarpreet-Kaur-Bhurji commented 2 months ago

Hi Sina, sure thing thank you for helping with this. PFA the fasta file herewith. Just to let you know I have also rerun the pipeline at my end but it will be a while until it reaches that step, so it will be great if you could check if the issue is resolved. HOG_D0138574.fa.gz

sinamajidian commented 2 months ago

Thanks. Yes. It finished successfully in our cluster. Hope it will be smooth in your side.

2024-08-29 04:25:47 DEBUG    Inferring subHOGs for batch of 1 rootHOGs started.
2024-08-29 04:25:48 INFO     number of proteins in the rHOG is 20269.
2024-08-29 04:25:48 INFO     Number of unique species in rHOG D0138574 is 18.
...
2024-08-29 04:41:26 INFO     All subHOGs for the rootHOG D0138574 as OrthoXML format is written in pickle_hogs/file_D0138574.pickle
Simarpreet-Kaur-Bhurji commented 2 months ago

Thank you so much for testing this on your side, will let you know how the run goes for us, finger crossed.

Simarpreet-Kaur-Bhurji commented 1 month ago

Hi Sina and Adrian, sorry it has taken a while for me to get back to you. So as it stands the run was still not complete on my end. When I got the segmentation fault I tried to up the memory by updating it to the following in FastOMA.nf file:

` memory { mem_cat(getMaxFileSize(rhogsbig), nr_species as int) task.attempt 3 }

After which I again got the segmentation fault error but with maxwm <- 0.0 the error and log files are attached herewith.

command.log.txt command.err.txt

The fasta file: HOG_D0138736.fa.txt

The size of the zipped folder is more than the allowed size for git I can send that via email. I also tried the sacct but I do not have the job id of the affected job anymore because it has been a while since I last run. Please let me know what you think is going on here.

srobb1 commented 1 month ago

Hello. I am having a RecursionError: maximum recursion depth exceeded error as well.

I am running 0.3.4. I am only running with 15 species.

I am pasting the last good line and the first error line from my .nextflow.log. I am also attaching a screenshot of my summary report. I am lost as to what I should do next to trouble shoot this issue.

Thank you, Sofia

from .netflow.log: Sep-19 20:50:47.487 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 2104348; id: 20; name: infer_roothogs (1); status: COMPLETED; exit: 1; error: -; workDir: /n/sci/SCI-004219-SBCHAMELEO/Chamaeleo_calyptratus/genomes/CCA3-haplotypes/analysis/gene_gain_loss/fastoma/work/85/169b353adc16f9830a97bcb887204c started: 1726787282133; exited: 2024-09-20T01:49:51.671447Z; ] Sep-19 20:50:47.487 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for task: name=infer_roothogs (1); work-dir=/n/sci/SCI-004219-SBCHAMELEO/Chamaeleo_calyptratus/genomes/CCA3-haplotypes/analysis/gene_gain_loss/fastoma/work/85/169b353adc16f9830a97bcb887204c error [nextflow.exception.ProcessFailedException]: Process infer_roothogs (1) terminated with an error exit status (1) Sep-19 20:50:47.518 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'infer_roothogs (1)'

Caused by: Process infer_roothogs (1) terminated with an error exit status (1)

Command executed:

fastoma-infer-roothogs --proteomes proteome --hogmap hogmaps --splice splice --out-rhog-folder "omamer_rhogs" -vv

Command exit status: 1

Command output: 291057 83867 There are 83867 candidate pairs of rhogs for merging.

There are 4776 clusters.

Command error: ^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/FastOMA/_utils_roothog.py", line 1205, in HCS H = HCS(sub_graphs[0]) ^^^^^^^^^^^^^^^^^^ [Previous line repeated 4 more times] File "/app/lib/python3.11/site-packages/FastOMA/_utils_roothog.py", line 1198, in HCS E = nx.algorithms.connectivity.cuts.minimum_edge_cut(G) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<class 'networkx.utils.decorators.argmap'> compilation 4", line 3, in argmap_minimum_edge_cut_1 File "/app/lib/python3.11/site-packages/networkx/utils/backends.py", line 633, in call return self.orig_func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/networkx/algorithms/connectivity/cuts.py", line 607, in minimum_edge_cut this_cut = minimum_st_edge_cut(H, v, w, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<class 'networkx.utils.decorators.argmap'> compilation 30", line 3, in argmap_minimum_st_edge_cut_27 File "/app/lib/python3.11/site-packages/networkx/utils/backends.py", line 633, in call return self.orig_func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/networkx/algorithms/connectivity/cuts.py", line 150, in minimum_st_edge_cut cut_value, partition = nx.minimum_cut(H, s, t, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<class 'networkx.utils.decorators.argmap'> compilation 34", line 3, in argmap_minimum_cut_31 File "/app/lib/python3.11/site-packages/networkx/utils/backends.py", line 633, in call return self.orig_func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/networkx/algorithms/flow/maxflow.py", line 457, in minimum_cut non_reachable = set(dict(nx.shortest_path_length(R, target=_t))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<class 'networkx.utils.decorators.argmap'> compilation 42", line 3, in argmap_shortest_path_length_39 File "/app/lib/python3.11/site-packages/networkx/utils/backends.py", line 633, in call return self.orig_func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/networkx/algorithms/shortest_paths/generic.py", line 301, in shortest_path_length G = G.reverse(copy=False) ^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/networkx/classes/digraph.py", line 1334, in reverse return nx.reverse_view(self) ^^^^^^^^^^^^^^^^^^^^^ File "<class 'networkx.utils.decorators.argmap'> compilation 46", line 4, in argmap_reverse_view_43 File "/app/lib/python3.11/site-packages/networkx/classes/graphviews.py", line 266, in reverse_view newG = generic_graph_view(G) ^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/networkx/classes/graphviews.py", line 104, in generic_graph_view newG = G.class() ^^^^^^^^^^^^^ File "/app/lib/python3.11/site-packages/networkx/classes/digraph.py", line 350, in init self._node = self.node_dict_factory() # dictionary for node attr ^^^^^^^^^^ RecursionError: maximum recursion depth exceeded

report_2024-09-19_09-22-09 html

sinamajidian commented 1 month ago

Hi @srobb1 Thanks for reaching out. We believe we fixed this issue by the update provided in the dev branch (discussed on this page). Please let us know if it helps your case as well. Feel free to open a new github issue if the problem continues, and please provide us more info about the system you are using and the tree. Best, Sina

sinamajidian commented 1 month ago

Hi @Simarpreet-Kaur-Bhurji It looks like it is a different rootHOG. Could you possibly run the command in the .command.sh (available inside the work folder) for this rootHOG and see how much memory it needs? (it would be best to copy the needed file in a new folder and run with slurm to have the full log). Btw, which MAFFt version are you using and how did you install it? Yes, please send me the rootHOG, I could try it out too. We would love to arrange our next meeting, probably in mid October.

Best, Sina

Simarpreet-Kaur-Bhurji commented 2 weeks ago

Hi Sina, sure thing we will get in touch via email to schedule our next meeting. We can look into the above issues then?