seqwish output with simulated data

Hi Erik, I am doing some precision/recall analysis on a simulated set of 13 samples where each "sample" is a random mutant of a real-life plant chromosome. I introduced exactly 200 SVs per sample and the types range between deletion, inversion, tandem-duplication, and translocations. The variant sizes are fixed at 500bp and 10kb. After using edyeet+seqwish to construct the graphs with these sequences, plus the original reference, I now have 14 graph of increasing complexity and would like to see how well the variants can be "deconstructed" from them. So I took the GFA->vg route for each graph and used 'vg snarls' to get the bubbles out. It reports a lot more variants than what I had introduced, even in a 2-sample graph. My suspicion is that edyeet misaligned some of the regions, and I want to try it again with more stringent parameters. Do you think this is something worth pursuing, or is edyeet not designed to handle this scenario?

Another question is about the GFA tags that seqwish puts it. Sorry if this is described in some obvious place, but what are the DP: RC: tags for?

Hi Eugene,

I would try to run smoothxg on the output. The edyeet alignments do not have affine gap penalties. This makes their representation of indels imprecise. But even better alignments (such as those made by kssw/minimap2) are not mutually normalized, and will result in complex looping motifs in e.g. low complexity sequence like microsatellites. By realigning the graph locally with POA (in smoothxg) the alignments are normalized relative to each other. The graph tends to be smaller than the one made directly by seqwish.

The pangenome graph builder (pggb) makes some attempts to link all these steps together, if you want to get a sense of a typical approach that we are using.

Also, increasing the segment length can reduce collapse which might appear to introduce more variation.

The tags are probably coming out of odgi view. They are designed to trick Bandage into displaying coverage for the nodes. I think RC is the number of path steps on the node. DP is a metric that is scaled by the length to meet Bandage's expectations.

On Fri, Dec 4, 2020, 23:21 Eugene Goltsman notifications@github.com wrote:

Hi Erik, I am doing some precision/recall analysis on a simulated set of 13 samples where each "sample" is a random mutant of a real-life plant chromosome. I introduced exactly 200 SVs per sample and the types range between deletion, inversion, tandem-duplication, and translocations. The variant sizes are fixed at 500bp and 10kb. After using edyeet+seqwish to construct the graphs with these sequences, plus the original reference, I now have 14 graph of increasing complexity and would like to see how well the variants can be "deconstructed" from them. So I took the GFA->vg route for each graph and used 'vg snarls' to get the bubbles out. It reports a lot more variants than what I had introduced, even in a 2-sample graph. My suspicion is that edyeet misaligned some of the regions, and I want to try it again with more stringent parameters. Do you think this is something worth pursuing, or is edyeet not designed to handle this scenario?

Another question is about the GFA tags that seqwish puts it. Sorry if this is described in some obvious place, but what are the DP: RC: tags for?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/69, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOECQM6J5G6T6YSJBTSTFOHPANCNFSM4UOA327Q .

ekg / seqwish

seqwish output with simulated data #69