Closed glennhickey closed 4 years ago
You may want to "groom" the graph to resolve this kind of thing. There is a tool in odgi for that.
I'll take a look.
On Thu, Jun 25, 2020, 21:29 Glenn Hickey notifications@github.com wrote:
I'm experimenting with using seqwish to convert hal https://github.com/ComparativeGenomicsToolkit/hal alignments from cactus to pangenome graphs because the (very old) hal2vg https://github.com/ComparativeGenomicsToolkit/hal2vg uses too much memory.
I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.
vg stats -lz evolver-sw.pg nodes 258042 edges 349693 length 845864
vg stats -lz evolver-h2vg.pg nodes 234353 edges 317222 length 789076
looking at the first difference I found in the deconstructed VCF gave me: hal2vg: [image: evolver-635336-h2vg] https://user-images.githubusercontent.com/901102/85785911-c0708b80-b6f7-11ea-80e7-010fdff187ba.png seqwish: [image: evolver-635336-sw] https://user-images.githubusercontent.com/901102/85785947-c8c8c680-b6f7-11ea-9ba2-ebea31ffe2d3.png
as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).
Everything needed to reproduce: example.tar.gz https://github.com/ekg/seqwish/files/4833416/example.tar.gz
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/56, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI7V6QKG5235JJBWJTRYOQRZANCNFSM4OIVPQ3A .
This is caused by the nodes being added in the foeward orientation of the path that is relatively reversed. Grooming should help. There might be a trick to add the sequence in without this kind of pattern. But, I suspect it will be tricky to implement correctly and so postprocessing might be safer.
On Fri, Jun 26, 2020, 10:06 Erik Garrison erik.garrison@gmail.com wrote:
You may want to "groom" the graph to resolve this kind of thing. There is a tool in odgi for that.
I'll take a look.
On Thu, Jun 25, 2020, 21:29 Glenn Hickey notifications@github.com wrote:
I'm experimenting with using seqwish to convert hal https://github.com/ComparativeGenomicsToolkit/hal alignments from cactus to pangenome graphs because the (very old) hal2vg https://github.com/ComparativeGenomicsToolkit/hal2vg uses too much memory.
I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.
vg stats -lz evolver-sw.pg nodes 258042 edges 349693 length 845864
vg stats -lz evolver-h2vg.pg nodes 234353 edges 317222 length 789076
looking at the first difference I found in the deconstructed VCF gave me: hal2vg: [image: evolver-635336-h2vg] https://user-images.githubusercontent.com/901102/85785911-c0708b80-b6f7-11ea-80e7-010fdff187ba.png seqwish: [image: evolver-635336-sw] https://user-images.githubusercontent.com/901102/85785947-c8c8c680-b6f7-11ea-9ba2-ebea31ffe2d3.png
as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).
Everything needed to reproduce: example.tar.gz https://github.com/ekg/seqwish/files/4833416/example.tar.gz
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/56, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI7V6QKG5235JJBWJTRYOQRZANCNFSM4OIVPQ3A .
Ok I see that this might be a problem. Thanks for the test case. I'll see what I can do.
All signs point to this being an issue with hal2paf
's cigar strings being wrong. Sorry!
The reverse strand alignments do look messy, but we checked @glennhickey's test case and found the problem was due to PAF format confusion. Cigars were reversed, and that led to a blowup in the size of the graph, as most of the "matched" sequences was not matching.
So the way to fix the messy reverse strand alignments is "grooming" as in odgi groom.
I'm experimenting with using seqwish to convert hal alignments from cactus to pangenome graphs because the (very old) hal2vg uses too much memory.
I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.
looking at the first difference I found in the deconstructed VCF gave me: hal2vg: seqwish:
as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).
Everything needed to reproduce: example.tar.gz