edawson / gfakluge

A C++ library and utilities for manipulating the Graphical Fragment Assembly format.
http://edawson.github.io/gfakluge/
MIT License
51 stars 20 forks source link

Trim overlapping bits of contigs [feature request] #21

Closed sjackman closed 6 years ago

sjackman commented 6 years ago

I'd find a tool to trim the overlapping bits of a Miniasm Nanopore assembly in GFA 1 format, so that the contigs are abut rather than overlapping.

edawson commented 6 years ago

Do you have an example GFA file I could use to base development around?

Just to make sure I understand, I imagine the process looking like:

  1. Read in the file
  2. Look for C lines and L lines that indicate overlapping bits
  3. Parse the cigar strings, modify the corresponding S line substrings, tweak the L start / ends
  4. Output the graph

This is a great feature suggestions - thanks!

sjackman commented 6 years ago

Yep! That's gist of it. Thanks!

You can use these instructions from the Miniasm GitHub page to assemble a GFA file. https://github.com/lh3/miniasm#getting-started

I'm not worried about C lines. I haven't seen them used in the wild myself. The CIGAR strings output by Miniasm are a single match operation (e.g. 1234M).

Here's the bit that requires a bit of thinking. X and A overlap by 1000 bp. X and B overlap by 2000 bp. How much do you trim from what?

You could trim 1000 bp from A and 2000 bp from X and B. The 2000 bp that was in the overlap is no longer represented in any contig. You could create a new segment of this 2000 bp sequence.

You can leave X alone, and trim 1000 bp from A and 2000 bp from B. I like this option. It works with the typical case where X has a few neighbours, and each neighbour of X has only X as a neighbour (on that particular side). It would get tricky in other cases, where the neighbours of X have neighbours other than X.

edawson commented 6 years ago

I chatted with @ekg about this today, and he reminded me that vg will do almost exactly this by default, even if it just gets a GFA file as input:

   vg view -F in.gfa > bluntified.out.gfa

I think this handles the ugly case of X having many neighbors, but I haven't tested it as yet.

sjackman commented 6 years ago

Fantastic! I'll test it out. Thanks, Eric and Erik! 🏅

sjackman commented 6 years ago

Is https://doi.org/10.1101/gr.214155.116 a suitable citation for vg?

edawson commented 6 years ago

I believe this would be more correct: https://doi.org/10.1101/234856. We're responding to reviews on this right now.

edawson commented 6 years ago

How'd this go? This is also somewhat related to what @dfguan is working on.

sjackman commented 6 years ago

I ran into difficulty compiling vg from source. I'll try the precompiled binary. Unicycler also trims overlapping sequences from a GFA file.

@rrwick How does Unicycler trim overlapping sequences, and could that tool be used outside of Unicycler to trim overlaps from a GFA file?

ekg commented 6 years ago

What problem did you have compiling vg from source?

On Fri, Mar 9, 2018, 18:37 Shaun Jackman notifications@github.com wrote:

I ran into difficulty compiling vg from source. I'll try the precompiled binary. Unicycler also trims overlapping sequences from a GFA file. @rrwick https://github.com/rrwick How does Unicycler trim overlapping sequences, and could that tool be used outside of Unicycler to trim overlaps from a GFA file?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/edawson/gfakluge/issues/21#issuecomment-371885288, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EYzumBQEZOwTUVM1seSrGtyNOctcks5tcr3ggaJpZM4SVRXp .

ekg commented 6 years ago

@sjackman I'd love to know what blocked your vg build.

Overlap reduction is extremely non-trivial due to the fact that networks of overlaps can map together into the same graph region and resolving them into graph nodes and edges is tricky. We use a concept called the pinch graph to do this. @bpaten could comment better on it.

I wish all assemblers produced fully resolved graphs in GFA rather than stuffing the assembly complexity into overlap records...

sjackman commented 6 years ago

@ekg Here's the error that I encountered when building vg from source. I haven't looked into it much further. https://github.com/vgteam/vg/issues/1586

rrwick commented 6 years ago

Sorry to come to this so late, I've been quite behind on my GitHub notifications... In case anyone is curious, here's how Unicycler deals with overlaps:

Unicycler has to deal with the relatively simple case of SPAdes graphs where all overlaps are exact (no goofy CIGARs) and the same size. If the overlaps were an even number of bases long, it would be easy: just chop off half the overlap sequence at each link. However, SPAdes makes graphs with odd length overlaps.

The problem is this: what if you have a graph with 99 bp overlaps. I could take the above approach (trim half the overlap from each side of the link) just with slightly unequal halves of 49 and 50 bp. But what if the graph contains a 99 bp segment? This is possible, at least for a SPAdes assembly graph. Now I have to be careful not to trim 50 bp from both ends, because that's trimming more sequence than the segment has!

I threw together an ad hoc algorithm to do it, and here's the source code: https://github.com/rrwick/Unicycler/blob/master/unicycler/assembly_graph.py#L1996-L2186

I think it works (haven't found an unsolvable case yet) but I don't doubt there's a better way to do it. I might look into the pinch graph method that Erik mentioned.

Ryan

sjackman commented 6 years ago

Thanks for the description, Ryan.

edawson commented 6 years ago

I'm going to close this, as I think the other tools mentioned in the thread support this feature better than I could implement it in the gfakluge package. However, I'm open to reviving it later if someone were to take up the mantle in implementing it.

sjackman commented 6 years ago

For future folk who stumble on this issue, both Unicycler and Flye output GFA assembly graphs with blunt sequences (no overlaps).