ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 18 forks source link

Reverse-strand alignments look messy sometimes #56

Closed glennhickey closed 4 years ago

glennhickey commented 4 years ago

I'm experimenting with using seqwish to convert hal alignments from cactus to pangenome graphs because the (very old) hal2vg uses too much memory.

I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.

vg stats -lz evolver-sw.pg
nodes   258042
edges   349693
length  845864

vg stats -lz evolver-h2vg.pg
nodes   234353
edges   317222
length  789076

looking at the first difference I found in the deconstructed VCF gave me: hal2vg: evolver-635336-h2vg seqwish: evolver-635336-sw

as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).

Everything needed to reproduce: example.tar.gz

ekg commented 4 years ago

You may want to "groom" the graph to resolve this kind of thing. There is a tool in odgi for that.

I'll take a look.

On Thu, Jun 25, 2020, 21:29 Glenn Hickey notifications@github.com wrote:

I'm experimenting with using seqwish to convert hal https://github.com/ComparativeGenomicsToolkit/hal alignments from cactus to pangenome graphs because the (very old) hal2vg https://github.com/ComparativeGenomicsToolkit/hal2vg uses too much memory.

I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.

vg stats -lz evolver-sw.pg nodes 258042 edges 349693 length 845864

vg stats -lz evolver-h2vg.pg nodes 234353 edges 317222 length 789076

looking at the first difference I found in the deconstructed VCF gave me: hal2vg: [image: evolver-635336-h2vg] https://user-images.githubusercontent.com/901102/85785911-c0708b80-b6f7-11ea-80e7-010fdff187ba.png seqwish: [image: evolver-635336-sw] https://user-images.githubusercontent.com/901102/85785947-c8c8c680-b6f7-11ea-9ba2-ebea31ffe2d3.png

as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).

Everything needed to reproduce: example.tar.gz https://github.com/ekg/seqwish/files/4833416/example.tar.gz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/56, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI7V6QKG5235JJBWJTRYOQRZANCNFSM4OIVPQ3A .

ekg commented 4 years ago

This is caused by the nodes being added in the foeward orientation of the path that is relatively reversed. Grooming should help. There might be a trick to add the sequence in without this kind of pattern. But, I suspect it will be tricky to implement correctly and so postprocessing might be safer.

On Fri, Jun 26, 2020, 10:06 Erik Garrison erik.garrison@gmail.com wrote:

You may want to "groom" the graph to resolve this kind of thing. There is a tool in odgi for that.

I'll take a look.

On Thu, Jun 25, 2020, 21:29 Glenn Hickey notifications@github.com wrote:

I'm experimenting with using seqwish to convert hal https://github.com/ComparativeGenomicsToolkit/hal alignments from cactus to pangenome graphs because the (very old) hal2vg https://github.com/ComparativeGenomicsToolkit/hal2vg uses too much memory.

I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.

vg stats -lz evolver-sw.pg nodes 258042 edges 349693 length 845864

vg stats -lz evolver-h2vg.pg nodes 234353 edges 317222 length 789076

looking at the first difference I found in the deconstructed VCF gave me: hal2vg: [image: evolver-635336-h2vg] https://user-images.githubusercontent.com/901102/85785911-c0708b80-b6f7-11ea-80e7-010fdff187ba.png seqwish: [image: evolver-635336-sw] https://user-images.githubusercontent.com/901102/85785947-c8c8c680-b6f7-11ea-9ba2-ebea31ffe2d3.png

as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).

Everything needed to reproduce: example.tar.gz https://github.com/ekg/seqwish/files/4833416/example.tar.gz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/56, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI7V6QKG5235JJBWJTRYOQRZANCNFSM4OIVPQ3A .

ekg commented 4 years ago

Ok I see that this might be a problem. Thanks for the test case. I'll see what I can do.

glennhickey commented 4 years ago

All signs point to this being an issue with hal2paf's cigar strings being wrong. Sorry!

ekg commented 4 years ago

The reverse strand alignments do look messy, but we checked @glennhickey's test case and found the problem was due to PAF format confusion. Cigars were reversed, and that led to a blowup in the size of the graph, as most of the "matched" sequences was not matching.

ekg commented 4 years ago

So the way to fix the messy reverse strand alignments is "grooming" as in odgi groom.