Open Kinggerm opened 3 years ago
Good morning,
Could I please clarify with you what to do if 4 sequences with the longest IR (of the same length) are identified? Also I am interested in question, how to quickly and automatically understand which of the 2 cases described here is implemented: the presence of symmetric and asymmetric configurations or the coincidence of short repeats in SC and IR.
Additionally, I am confused by the abundance of small "tails" along the entire length of the graph.
Thank you in advance!
Best regards, Elena
Background
Small repeats in the assembly graph can hamper the genome assembly. However, in some cases when we have prior knowledge of the target, e.g. the plastome structure, we may have the resolution without additional information. Here's one example.
Supposing we achieved an plastome-sufficient assembly graph like above, we may get 8 paths using GetOrganelle with log info:
We use red to denote those contigs with higher coverage, which are continuous in the graph and likely to be the IR regions (check the node of 11600 and 11536). So this sample is definitely not a DR plastome, which is rare and currently only discovered in Selaginella species, and PATH1 should thus be excluded. You may further confirm this by loading the *.CSV file into Bandage, which can find that genes in red contigs are likely to be in the IRs.
Now we have a relative clear understand of this plastome assembly graph, that the LSC region and the IR regions shared two short repeats (contig 111440 and 111548, set in orange).
Solution 1 (laborious)
We can manually duplicate contig 111440 and 111548,
prune the connections,
making the graph like a typical plastome assembly graph. Then we may export the edited assembly graph and use
get_organelle_from_assembly.py
to generate the final two paths.Solution 2 (command line)
However, manually adjusting assembly graph in Bandage is laborious. We could use another approach, the script
plastome_arch_info.py
of GetOrganelle by type inthen we will get
where we can find that PATH3 and PATH4 has the longest IR size, which should be our final plastome result. One may use following commands to quickly find the target file(s):
which can get you:
You may add a
for-loop
and acp
to quickly pick the target path(s) out.