Closed glennhickey closed 4 years ago
Thanks. That's really weird! Will check it out.
On Mon, Aug 10, 2020, 22:50 Glenn Hickey notifications@github.com wrote:
Here's a simple case inspired from real data from a cactus alignment:
dummy.fa
Anc AAATAAA 1 AAATAAA 2 AAATAAA 3 AAAGAAA 4 AAAGAAA
dummy.paf
1 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M 2 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M 3 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M 4 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M
seqwish -p dummy.paf -s dummy.fa -g dummy.gfa -P
Those Gs are transitively aligned in the PAF by way of the aligning to T in Anc but they don't come out in the graph: [image: dummy] https://user-images.githubusercontent.com/901102/89830034-92210100-db29-11ea-820c-3944e1503e7c.png
This is the exact same issue as ComparativeGenomicsToolkit/hal2vg#26 https://github.com/ComparativeGenomicsToolkit/hal2vg/issues/26, funnily enough.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/60, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPMVXW6RODVBE6ODU3SABMTHANCNFSM4P2K2RQA .
OK, these aren't transitively aligned so the result you're getting is not a bug but an exact representation of the input alignments.
To try to write this out:
Anc AAATAAA
1 AAATAAA
2 AAATAAA
3 AAAgAAA
4 AAAgAAA
The upper-case characters match Anc. The lower-case g
s aren't matched to anything, so seqwish does not close through them, nor map them to each other. It's not given any information to indicate that they match.
An all-to-all alignment would provide this, or adding in the alignments within each SNP or variant (but that seems tedious). Alternatively, you could push the output through smoothxg:
seqwish -p dummy.paf -s dummy.fa -g dummy.gfa
smoothxg -g dummy.gfa >dummy.smooth.gfa
This result matches your expectation:
H VN:Z:1.0
S 1 AAA DP:i:5 RC:i:15
L 1 + 3 + 0M
L 1 + 2 + 0M
S 2 T DP:i:3 RC:i:3
L 2 + 4 + 0M
S 3 G DP:i:2 RC:i:2
L 3 + 4 + 0M
S 4 AAA DP:i:5 RC:i:15
P Anc 1+,2+,4+ *
P 1 1+,2+,4+ *
P 2 1+,2+,4+ *
P 3 1+,3+,4+ *
P 4 1+,3+,4+ *
Cool thanks. Kind of figured it was by design but thought I'd bring it up anyway, mostly to make me feel slightly less bad about forgetting to handle this in hal2vg
I don't think it's trivial to handle! Yeah, seqwish is meant to be a pretty "dumb" algorithm. It just takes the input that's given, with some optional filters (-k
, -l
, and -r
) to remove or ignore exact matches of given lengths or implied local/global self-repeat count.
Here's a simple case inspired from real data from a cactus alignment:
Those
G
s are transitively aligned in the PAF by way of the aligning toT
inAnc
but they don't come out in the graph:This is the exact same issue as https://github.com/ComparativeGenomicsToolkit/hal2vg/issues/26, funnily enough.