Kinggerm / GetOrganelle

Organelle Genome Assembly Toolkit (Chloroplast/Mitocondrial/ITS)
GNU General Public License v3.0
273 stars 51 forks source link

Insert gaps with get_organelle_from_assembly.py- Extracting anonym from the assemblies failed #118

Open zactobias44 opened 2 years ago

zactobias44 commented 2 years ago

First, let me just say how great this software is! I see great improvement over other mitochondrial genome assembly tools.

Now for the issue. I have pretty low coverage (average coverage < 20, average base-coverage ~60) for many of my samples, so I end up with a lot of incomplete assemblies with 3-8 scaffolds. I am trying to insert gaps (using the approach in FAQ) for downstream analysis using a published mitogenome for the same species, but am running into issues.

join_spades_fastg_by_blast.py appears to be working just fine, it is the following command that I am struggling with:

get_organelle_from_assembly.py -g extended*Bs_mito_sc6ab.label.fastg.Ncontigs_added.fastg --genes /vortexfs1/home/ztobias/BotryllusMito/metadata/Bs_mito_sc6ab.label.fasta -t 8 -o out -F anonym --expected-min-size 1450 --expected-max-size 1550 --overwrite

The first type of error I get is the following (from get_org.log.txt)

2021-12-17 10:11:14,109 - INFO: Disentangling out/slimmed_assembly_graph.fastg as a/an Bs_mito_sc6ab.label-insufficient graph ... 
2021-12-17 10:12:05,878 - INFO: Disentangling failed: 'Unable to generate result with single copy vertex percentage < 50%'
2021-12-17 10:12:05,878 - INFO: Please visualize out/slimmed_assembly_graph.fastg and load out/slimmed_assembly_graph.csv to confirm the incomplete result.
2021-12-17 10:12:05,878 - INFO: If you have questions for us, please provide us with the get_org.log.txt file and the post-slimming graph in the format you like!
2021-12-17 10:12:05,878 - INFO: Extracting anonym from the assemblies failed.

I am not sure why this fails, as it works for many other samples. I have attached the get_org.log.txt and post-slimming graph (renamed with txt extension b/c github won't accept fastq) get_org.log.txt slimmed_assembly_graph.fastg.txt

I am also receiving a different type of error and will post in a separate issue.

Thanks!

Kinggerm commented 2 years ago

Sorry for the late reply. I really like the detailed description of your issue.

As you mentioned, the coverage is really low, using join_spades_fastg_by_blast.py is one typical choice if you still want a circular sequence topology with gaps. However, the join_spades_fastg_by_blast.py is currently immatured and not smart enough. It will adds all potential gaps/overlap edges that it roughly detected from blasting-to-ref. So we need to do some manual processing on the graph.

slimmed_assembly_graph fastg

For example, as you can see from the attached figure (Bandage-visualized version of slimmed_assembly_graph.fastg.txt), there will be 692 or 691-435-693 between edge 13 and 683. The 692 is thus redundant because 691-435-693 has the informative edge 435 which is from the original assembly.

Except for comparing the parallel paths, another criterion for removing redundant edge is identifying the unrealistic edges. For example, edge 686 is indicating there might be a 1492-bp overlap between 463 and 445, which is insane because neither 463 and 445 has that sequence length to overlap. So we remove 686. # This should be definitely scheduled to be improved/fixed in join_spades_fastg_by_blast.py. I would like to keep this issue open until fixed. @wbyu

Following the above two criteria, we can remove all edges in red and achieve the circular sequence with gaps/overlaps.

Please let me know if you have further questions.

zactobias44 commented 2 years ago

Thanks for describing this solution! I'll go ahead and process these manually. Question about removing edges. Do you just go into the fastg file and remove the entries for the edges in question? And then run get_organelle_from_assembly.py with that edited fastg as input?

Kinggerm commented 2 years ago

Thanks for describing this solution! I'll go ahead and process these manually. Question about removing edges. Do you just go into the fastg file and remove the entries for the edges in question? And then run get_organelle_from_assembly.py with that edited fastg as input?

You may do that. But I usually do it in Bandage, which could save the edited graph as a gfa file. And then run get_organelle_from_assembly.py with that gfa file.

zactobias44 commented 2 years ago

Oh okay great. Thank you!