Prunoideae / MitoFlex

A mitogenome toolkit inspired by MitoZ, while being more effective, precise and flexible.
GNU General Public License v3.0
18 stars 5 forks source link

Error when running SOAPdenovo-fusion #13

Open kenziegrover opened 1 year ago

kenziegrover commented 1 year ago

Dears, Thanks for making a great program. I've run into an error message and I am looking for guidance on how to resolve it.

I have installed MitoFlex on my Linux Ubuntu OS and have installed all dependencies. When I run load-modules, all modules appear to be working correctly. However, when I run the test data using /path/MitoFlex.py --config /path/test_config.py, the analysis runs up until it gets to SOAPdenovo where it prints the following error message:

"Error when running command '/path/MitoFlex/assemble/SOAPdenovo-fusion -D -s /hpath/MitoFlex/test/test_generated/test_generated.temp/assemble/test_generated.scaf/soaplib.txt -p 8 -K 75 -g /path/MitoFlex/test/test_generated/test_generated.temp/assemble/test_generated.scaf/k75 -c /path/MitoFlex/test/test_generated/test_generated.temp/assemble/test_generated.result/k141.contig.fa'. Exiting. A RuntimeError was occured. This is already considered in the code, but since it's thought to be errors in parts outside the MitoFlex can handle, it's NOT a bug caused by MitoFlex itself. Please check the error message and try to fix the possible cause of the crash, only as a last resort, send github a issue with a rerun with logger level set to 0."

I have run the analysis with logger set to 0 but it did not clarify what the issue is. Do you have any idea what might be causing the error and how to resolve it? Please let me know if there is any other information I can provide to clarify the issue.

Many thanks.

Prunoideae commented 1 year ago

Hmm, can you post the log of this run? On my machine it works without problem, I tried to recreate the conda environment with environment.yml and it worked fine.

How did you install the dependencies? I think it's possibly caused by changes in program versions.

At the time I wrote everything, conda is still in a pretty early version, the syntax of install from dependencies is changed. You need to type conda create env --file environment.yml to install dependencies correctly for current versions of conda.

kenziegrover commented 1 year ago

Hello,

I was unable to create the environment using the environment.yml file as it would get stuck on the solving environment step for hours, so I installed everything by hand using "conda install -c conda-forge package name=version", and used the versions specified in the environment.yml file.

I've attached the log file that was produced. Please let me know if you can think of anything that might be causing this issue.

test_generated.log

Prunoideae commented 1 year ago

This is weird... I think it might be related to platform or machine-specific problem, as the problem emerges from MitoFlex trying to do scaffolding via SOAPdenovo-fusion included in the pipeline, this binary is not available on conda for unknown reason, so it might not be working if the CPU architecture or platform changed.

As a workaround, I think you can skip the scaffolding module by setting disable_scaffolding = True in the test_config.py. Since most animal mitochondrial genomes are less than 20kbps in size, scaffolding is usually meaningless and can be safely skipped when assembling mitogenomes.

Also, MitoFlex itself implements a strategy to merge linear sequences incorrectly resolved by assemblers (since most of them are targeting for linear assemblies), which makes it safer to skip scaffolding since sequences are already merged.

kenziegrover commented 1 year ago

Thank you for your response. Disabling the scaffolding worked and the test files were executed correctly. I've since tried to implement my own data files in the program and have run into another issue I hope you can advise me on. My fastq files are approximately 30 gbs, are sequenced from Cyclophyllidean tapeworms and I suspect they are rather messy. When I run it through the program, with scaffolding disabled, Platyhelminthes selected for the clade, and a depth-list of '0,0,0,0,0,0,0', I get the following error message:

[10:41:09 WARN ] assemble : Iteration broke at kmer = 39, since no valid contig in kmer = 49 is done! [10:41:09 INFO ] assemble : Scaffolding skipped due to disabled. [10:41:09 DEBUG] helper : <function assemble at 0x7feefe38fd90> execution finished in 3.06s. [10:41:09 DEBUG] helper : Entering <function findmitoscaf at 0x7feefe38fea0>. [10:41:10 INFO ] findmitoscaf : Finding mitochondrial scaffold. [10:41:10 DEBUG] findmitoscaf : Updating the general protein database. [10:41:10 DEBUG] findmitoscaf : Generation finished with 177242 writes. [10:41:10 DEBUG] findmitoscaf : nhmmer profile : /mnt/c/Users/kgrover/Desktop/MitoFlex/profile/CDS_HMM/Platyhelminthes.hmm [10:41:10 DEBUG] annotation_tookit : Calling nhmmer. [10:41:10 DEBUG] annotation_tookit : Out file : o=/mnt/c/Users/kgrover/Desktop/MitoFlex/A013441/A013441_A/A013441_A.temp/findmitoscaf/A013441_A.nhmmer.out, tbl=/mnt/c/Users/kgrover/Desktop/MitoFlex/A013441/A013441_A/A013441_A.temp/findmitoscaf/A013441_A.nhmmer.tblout [10:41:10 DEBUG] annotation_tookit : HMM query have 0 results. [10:41:10 DEBUG] findmitoscaf : Generating hmm-filtered fasta. [10:41:10 ERROR] MitoFlex : Parsed fasta file is empty!

I've changed the kmer list to '21,29,39,49,59,87,113' but it always gets caught around kmer=59. It appears to me as though it is not assembling correctly. I did notice that I am getting a warning message saying the two files are not the same length, however I am not aware of any way to change that as my computer system is not able to manage 30gb files.

I've attached my config file (reformatted to a .txt) and the log file output for your consideration. Are there any parameters that you would recommend changing?

Thank you for your time and consideration.

A013441_config.txt

A013441_A.log

Prunoideae commented 1 year ago

It looks like even megahit is not able to assemble enough contigs from your data, the iterative assembling will stop if next kmer fails to output any contigs (which means that assembling contigs by kmer=n would not give any result). This is not a problem in MitoFlex or megahit.

I noticed that even with depth filter = 0, the contigs output at kmer=21 is only 4, this often indicates a dataset with extremely low integrity. You might need to check your fastq inputs.

MitoFlex is designed to assemble mitogenomes from whole genome or even metagenomic data, so it uses megahit, a metagenome assembler as the core. At the assembly stage, MitoFlex does not have any preliminary knowledge about if a contig is mitochondrial or not, and all contigs assembled are reported. Thus, if the megahit reported only 4 contigs on kmer = 21, it means that the input itself does not support to produce more contigs for MitoFlex to process.

Maybe you can try with other WGS data available and see if megahit is working well first? Megahit being not able to produce sufficient amount of contigs at kmer=21 is something pretty rare, to be honest.