Closed sjackman closed 6 years ago
Do you have an example GFA file I could use to base development around?
Just to make sure I understand, I imagine the process looking like:
C
lines and L
lines that indicate overlapping bits S
line substrings, tweak the L
start / ends This is a great feature suggestions - thanks!
Yep! That's gist of it. Thanks!
You can use these instructions from the Miniasm GitHub page to assemble a GFA file. https://github.com/lh3/miniasm#getting-started
I'm not worried about C
lines. I haven't seen them used in the wild myself.
The CIGAR strings output by Miniasm are a single match operation (e.g. 1234M
).
Here's the bit that requires a bit of thinking. X and A overlap by 1000 bp. X and B overlap by 2000 bp. How much do you trim from what?
You could trim 1000 bp from A and 2000 bp from X and B. The 2000 bp that was in the overlap is no longer represented in any contig. You could create a new segment of this 2000 bp sequence.
You can leave X alone, and trim 1000 bp from A and 2000 bp from B. I like this option. It works with the typical case where X has a few neighbours, and each neighbour of X has only X as a neighbour (on that particular side). It would get tricky in other cases, where the neighbours of X have neighbours other than X.
I chatted with @ekg about this today, and he reminded me that vg will do almost exactly this by default, even if it just gets a GFA file as input:
vg view -F in.gfa > bluntified.out.gfa
I think this handles the ugly case of X having many neighbors, but I haven't tested it as yet.
Fantastic! I'll test it out. Thanks, Eric and Erik! 🏅
Is https://doi.org/10.1101/gr.214155.116 a suitable citation for vg?
I believe this would be more correct: https://doi.org/10.1101/234856. We're responding to reviews on this right now.
How'd this go? This is also somewhat related to what @dfguan is working on.
I ran into difficulty compiling vg
from source. I'll try the precompiled binary.
Unicycler also trims overlapping sequences from a GFA file.
@rrwick How does Unicycler trim overlapping sequences, and could that tool be used outside of Unicycler to trim overlaps from a GFA file?
What problem did you have compiling vg from source?
On Fri, Mar 9, 2018, 18:37 Shaun Jackman notifications@github.com wrote:
I ran into difficulty compiling vg from source. I'll try the precompiled binary. Unicycler also trims overlapping sequences from a GFA file. @rrwick https://github.com/rrwick How does Unicycler trim overlapping sequences, and could that tool be used outside of Unicycler to trim overlaps from a GFA file?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/edawson/gfakluge/issues/21#issuecomment-371885288, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EYzumBQEZOwTUVM1seSrGtyNOctcks5tcr3ggaJpZM4SVRXp .
@sjackman I'd love to know what blocked your vg build.
Overlap reduction is extremely non-trivial due to the fact that networks of overlaps can map together into the same graph region and resolving them into graph nodes and edges is tricky. We use a concept called the pinch graph to do this. @bpaten could comment better on it.
I wish all assemblers produced fully resolved graphs in GFA rather than stuffing the assembly complexity into overlap records...
@ekg Here's the error that I encountered when building vg
from source. I haven't looked into it much further. https://github.com/vgteam/vg/issues/1586
Sorry to come to this so late, I've been quite behind on my GitHub notifications... In case anyone is curious, here's how Unicycler deals with overlaps:
Unicycler has to deal with the relatively simple case of SPAdes graphs where all overlaps are exact (no goofy CIGARs) and the same size. If the overlaps were an even number of bases long, it would be easy: just chop off half the overlap sequence at each link. However, SPAdes makes graphs with odd length overlaps.
The problem is this: what if you have a graph with 99 bp overlaps. I could take the above approach (trim half the overlap from each side of the link) just with slightly unequal halves of 49 and 50 bp. But what if the graph contains a 99 bp segment? This is possible, at least for a SPAdes assembly graph. Now I have to be careful not to trim 50 bp from both ends, because that's trimming more sequence than the segment has!
I threw together an ad hoc algorithm to do it, and here's the source code: https://github.com/rrwick/Unicycler/blob/master/unicycler/assembly_graph.py#L1996-L2186
I think it works (haven't found an unsolvable case yet) but I don't doubt there's a better way to do it. I might look into the pinch graph method that Erik mentioned.
Ryan
Thanks for the description, Ryan.
I'm going to close this, as I think the other tools mentioned in the thread support this feature better than I could implement it in the gfakluge package. However, I'm open to reviving it later if someone were to take up the mantle in implementing it.
For future folk who stumble on this issue, both Unicycler and Flye output GFA assembly graphs with blunt sequences (no overlaps).
I'd find a tool to trim the overlapping bits of a Miniasm Nanopore assembly in GFA 1 format, so that the contigs are abut rather than overlapping.