ekg / gimbricate

recompute GFA link overlaps
MIT License
25 stars 4 forks source link

gimbricate segfaults on large overlaps #2

Closed jflot closed 4 years ago

jflot commented 4 years ago

I am working on a GFA that comprises some quite large overlaps (the largest one is 198,380 bp), but when I run gimbricate -g x.gfa -f x.fa -p x.paf -n -t 1 gimbricate segfaults on an overlap that is 66,821 bp, resulting in a (extremely) truncated PAF. When I check the overlaps in the truncated PAF outputted by gimbricate the largest one is 22,777 bp, which suggests that the maximum overlap length gimbricate can digest is somewhere between 22,777 bp and 66,821 bp.

Here is the point where gimbricate stops: L 1837 + 95420 + 711M L 1837 - 28368 + 1170M L 1837 - 28369 + 1170M L 1838 + 61856 - 570M L 1838 - 6328 + 246M L 1838 - 30095 + 547M L 1839 - 6328 + 246M L 1841 + 93169 + 800M L 1841 + 93170 + 800M Segmentation fault

and here is the list of overlaps of contig 1841 in the starting GFA: L 1841 + 93169 + 800M L 1841 + 93170 + 800M L 1841 - 5754 + 66821M L 3524 + 1841 + 66821M

Thanks a lot in advance for your help.

ekg commented 4 years ago

This is to be expected. It would be easy to resolve by using edlib or dozeu to generate the overlap alignments.

If the alignments are that long, you can also directly regenerate the graph using minimap2 and seqwish. I know that might be complicated. You might also want to force alignment only between nodes that are already paired in your graph.

On Wed, Jun 3, 2020, 17:35 Jean-François Flot notifications@github.com wrote:

I am working on a GFA that comprises some quite large overlaps (the largest one is 198,380 bp), but when I run gimbricate -g x.gfa -f x.fa -p x.paf -n -t 1 gimbricate segfaults on an overlap that is 66,821 bp, resulting in a (extremely) truncated PAF. When I check the overlaps in the truncated PAF outputted by gimbricate the largest one is 22,777 bp, which suggests that the maximum overlap length gimbricate can digest is somewhere between 22,777 bp and 66,821 bp.

Here is the point where gimbricate stops: L 1837 + 95420 + 711M L 1837 - 28368 + 1170M L 1837 - 28369 + 1170M L 1838 + 61856 - 570M L 1838 - 6328 + 246M L 1838 - 30095 + 547M L 1839 - 6328 + 246M L 1841 + 93169 + 800M L 1841 + 93170 + 800M Segmentation fault

and here is the list of overlaps of contig 1841 in the starting GFA: L 1841 + 93169 + 800M L 1841 + 93170 + 800M L 1841 - 5754 + 66821M L 3524 + 1841 + 66821M

Thanks a lot in advance for your help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/gimbricate/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKOU6EXA7Z6GX6ANH3RUZUT5ANCNFSM4NRY6LRQ .

jflot commented 4 years ago

The GFA comes from a assembler we are currently developing, and we would like to turn this GFA into rGFA. I tried to export the contigs and detect alignments using minimap2, but this creates a lot of spurious hits that are not easy to filter out (there are very long overlaps in the GFA I am dealing with, but also very small ones). Forcing alignments only between nodes that already overlap in my graph sounds interesting - what would be the easiest way to do so? Also, one particularity of the GFAs I am dealing with is that they contain only perfect overlaps. As a result, it should be not too difficult to bluntify them, I think. Since my previous message I have been playing a bit more with gimbricate and it appears that the length threshold that causes the segfault is somewhere between 32500 bp and 50000 bp.

ekg commented 4 years ago

The segfault will be system dependent, and is caused by the size of the alignment matrices that you're making when you do these alignments. These are quadratic in the size of the sequences that are aligned. To fix this, it'd be sufficient to use a banded aligner like edlib (very fast).

It is true that minimap2 is very difficult to work with. It will produce alignments for every chain it finds, while what's needed in the case of graph construction are a few chains that cover the entire query with minimal recombination between the sequences you're mapping against. Based on this issue, I'm working on an alternative but it isn't ready for this use yet. Presently, I tend to just remove the short alignments. Different scripted approaches to the filtering might also work well (such as keeping the N best alignments that cover the read) but I haven't worked with them much.

I'm curious what you're going to use rGFA for. Is your assembly graph hierarchical? That might be a requirement for embedding it in rGFA. Otherwise, in generic GFA based graphs, paths embedded in the graph provide coordinates. These positions can be attached to the nodes using xg -i graph.gfa -G >annotated.gfa.

ekg commented 4 years ago

Again, as for the segfault: you're almost certainly running out of memory, or not having enough to build the matrices. This error seems to not be handled correctly, resulting in the segfault.

jflot commented 4 years ago

What I am chiefly interested in it to "bluntify" the GFAs produced by our assembler. Right now the assembly size we get (= sum of the sizes of the nodes in our GFA) is 1.5 times the size we expect, but when uploading our GFAs to Bandage the "Total length (no overlaps)" size matches precisely what we expect. So if we "bluntify" the GFA this should yield a graph with the proper total node size, without redundancy. And indeed, when I used minimap2 to map the set of contigs on itself then passed the resulting PAF to seqwish to generate a rGFA, the size was right (but there were many spurious edges). I should mention that our genome is highly repetitive, so any approach that look for exact matches between contigs without constraining them to nodes that are adjacent in the GFA will likely generate lots of spurious edges. Do you know any approach other than gimbricate to generate a PAF file from a GFA? Alternatively, might it be possible for you to add a switch to gimbricate that makes it only consider perfect overlaps (thereby possibly using much less memory and therefore making it able to deal with much larger overlaps, as long as they are perfect)?

ekg commented 4 years ago

I don't know of a similar tool to convert between a GFA and PAF+FASTA.

Having exact overlaps could simplify things. You might be able to use stark: https://github.com/hnikaein/stark to bluntify the graph. I'm just not sure if it will work on graphs with variable length exact overlaps. It's designed for de Bruijn graphs. That would be nice because in principle it won't increase the k-mer complexity of the graph.

As for gimbricate and your graphs, if your overlaps are correct then we can just turn off the realignment. Does that make sense to you? I could add a flag to turn it off and just emit the PAF and FASTA. Then you can drop the result into seqwish.

Bear in mind that seqwish doesn't make rGFA, although it does make a subset of GFA that's blunt-ended and compatible with variation graph tools (such as those described here https://pangenome.github.io/).

On Wed, Jun 3, 2020 at 8:14 PM Jean-François Flot notifications@github.com wrote:

What I am chiefly interested in it to "bluntify" the GFAs produced by our assembler. Right now the assembly size we get (= sum of the sizes of the nodes in our GFA) is 1.5 times the size we expect, but when uploading our GFAs to Bandage the "Total length (no overlaps)" size matches precisely what we expect. So if we "bluntify" the GFA this should yield a graph with the proper total node size, without redundancy. And indeed, when I used minimap2 to map the set of contigs on itself then passed the resulting PAF to seqwish to generate a rGFA, the size was right (but there were many spurious edges). I should mention that our genome is highly repetitive, so any approach that look for exact matches between contigs without constraining them to nodes that are adjacent in the GFA will likely generate lots of spurious edges. Do you know any approach other than gimbricate to generate a PAF file from a GFA? Alternatively, might it be possible for you to add a switch to gimbricate that makes it only consider perfect overlaps (thereby possibly using much less memory and therefore making it able to deal with much larger overlaps, as long as they are perfect)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/gimbricate/issues/2#issuecomment-638372018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPCCOTBJ544ACYV6NLRU2HG5ANCNFSM4NRY6LRQ .

jflot commented 4 years ago

Thanks for the info regarding stark, I just tried it and it segfaults when I give it a GFA as input but it will certainly be useful to me when working on DBGs. Regarding my current issue, it would be great to have a switch to turn off the realignment, it will solve my problem indeed.

jflot commented 4 years ago

While waiting for the flag to be added to gimbricate, we made a fork in which we commented out the code realigning the overlaps. This seems to be working well in our case.

ekg commented 4 years ago

Thanks for letting me know, and glad this is working. If you make this optional by adding a command line flag then I'll be happy to merge it back in.

Edlib is the next step. That's not too hard to integrate and will mostly resolve this issue.

On Sun, Jul 19, 2020, 16:38 Jean-François Flot notifications@github.com wrote:

While waiting for the flag to be added to gimbricate, we made a fork https://github.com/eeg-ebe/gimbricate in which we commented out the code realigning the overlaps. This seems to be working well in our case.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/gimbricate/issues/2#issuecomment-660654530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJB6PBMUGYWNTJ2NEDR4MAMXANCNFSM4NRY6LRQ .

ekg commented 4 years ago

In #6 I have a fix in for this. Would you be willing to test on some of your datasets? I'll move to merge shortly, so you might just pull from master when you get to this.

ekg commented 4 years ago

I can reproduce the out of memory issue on master:

-> % time ~/gimbricate-master/build/gimbricate -g assemblyGraph_k63.fixed.gfa -p assemblyGraph_k63.paf -f assemblyGraph_k63.fa -t 16 >assemblyGraph_k63.gimbry.gfa 
[2]    1349399 killed     ~/gimbricate-master/build/gimbricate -g assemblyGraph_k63.fixed.gfa -p  -f  -

This is not a problem with the edlib-overlaps branch.

ekg commented 4 years ago

This has been resolved via #6