Kinggerm / GetOrganelle

Organelle Genome Assembly Toolkit (Chloroplast/Mitocondrial/ITS)
GNU General Public License v3.0
253 stars 51 forks source link

Multiple variants with structural variations; 'Consensus made' keeps repeating #223

Open SethMusker opened 1 year ago

SethMusker commented 1 year ago

Hi,

I've had generally good success assembling chloroplasts with get_organelle_from_assembly.py!

One of my samples fails though, with three repeats of the message Consensus made: (187292-|189390+) followed by Disentangling failed: 'Unable to generate result with single copy vertex percentage < 50%'.

Is this intended behaviour? I imagine that after the consensus is made, disentangling would be attempted again with the consensus sequence replacing the two previously merged edges, but this is not happening. These two edges have the same length and similar depth, so I understand they're a cause for concern in principle, but they're also only 155bp long and only differ by one base (C/T).

get_org.log.txt slimmed_assembly_graph - Copy.gfa.txt slimmed_assembly_graph - Copy.csv

If this is intended, I suppose the next step would be to manually remove one of the edges from the starting graph and repeat?

Cheers, Seth

Kinggerm commented 1 year ago

Hi Seth,

Thanks for providing enough details about this complex situation.

  1. You are right these three trials are problematic. In principle, it was intended to be two different trials, resulting in two repeats of the same message. I just coincidentally corrected this problem in a recent update at a testing GetOrganelle branch. Anyway, you also own the credit for reporting it. BTW, the testing branch contains many major changes and is not ready to use.
  2. By looking at the graph, this sample seems to suffer from both mt-pt and multiple variants (heteroplasmy or contamination). Both issues can potentially violate the single-component (i.e. single variant) assumption of GetOrganelle. Here, the contigs of 10-ish depth should be the mt contigs. Because of the differentiation in coverage (10-ish v.s. 150-ish), GetOrganelle can differentiate the mt and pt easily. However, the two conceived variants in this sample have similar average coverages, resulting in two difficult problems.
    • SNPs usually result in simple parallel contigs like 187292 and 189390, upon which a consensus can be made if possible (GetOrganelle did it correctly for 187292 and 189390), or multiple different results containing different SNPs respectively can be generated, or a single result with the highest depth. Thus, the single-component assumption can persist.
    • Structural variations may yield a complex tangled graph, which will be difficult for simple algorithms to tell apart from a single-variant real-complex graph. Given the knowledge of what an IR-containing pt graph may look like, I can tell that 188784, 189490, 189444, 186851, and 188912 together form a structure composing two pt variants with IR boundary differences. This is the real problem that triggers the failure of disentanglement. We may leave this issue open until there is a neat solution.

In summary, you are right about manually removing a contig from the graph and then rerunning. However, the focal contigs are not 187292 and 189390 (you may remove one of them if you think the consensus is not a good idea though). Instead, the real solution is to remove either 189444 or 186851.

Best, Jianjun

SethMusker commented 1 year ago

Hi Jianjun,

Immense thanks! Your reply was really informative and helpful. After removing 189444 the assembly finished nicely. For the record, the log is attached.

get_org.log.txt

All the best, Seth