ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 18 forks source link

Seqwish does not transitively align through SNPs #60

Closed glennhickey closed 4 years ago

glennhickey commented 4 years ago

Here's a simple case inspired from real data from a cactus alignment:

# dummy.fa
>Anc
AAATAAA
>1
AAATAAA
>2
AAATAAA
>3
AAAGAAA
>4
AAAGAAA

# dummy.paf
1   7   0   7   +   Anc 7   0   7   7   7   255 cg:Z:7M
2   7   0   7   +   Anc 7   0   7   7   7   255 cg:Z:7M
3   7   0   7   +   Anc 7   0   7   7   7   255 cg:Z:7M
4   7   0   7   +   Anc 7   0   7   7   7   255 cg:Z:7M

seqwish -p dummy.paf -s dummy.fa -g dummy.gfa -P

Those Gs are transitively aligned in the PAF by way of the aligning to T in Anc but they don't come out in the graph: dummy

This is the exact same issue as https://github.com/ComparativeGenomicsToolkit/hal2vg/issues/26, funnily enough.

ekg commented 4 years ago

Thanks. That's really weird! Will check it out.

On Mon, Aug 10, 2020, 22:50 Glenn Hickey notifications@github.com wrote:

Here's a simple case inspired from real data from a cactus alignment:

dummy.fa

Anc AAATAAA 1 AAATAAA 2 AAATAAA 3 AAAGAAA 4 AAAGAAA

dummy.paf

1 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M 2 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M 3 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M 4 7 0 7 + Anc 7 0 7 7 7 255 cg:Z:7M

seqwish -p dummy.paf -s dummy.fa -g dummy.gfa -P

Those Gs are transitively aligned in the PAF by way of the aligning to T in Anc but they don't come out in the graph: [image: dummy] https://user-images.githubusercontent.com/901102/89830034-92210100-db29-11ea-820c-3944e1503e7c.png

This is the exact same issue as ComparativeGenomicsToolkit/hal2vg#26 https://github.com/ComparativeGenomicsToolkit/hal2vg/issues/26, funnily enough.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/60, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPMVXW6RODVBE6ODU3SABMTHANCNFSM4P2K2RQA .

ekg commented 4 years ago

OK, these aren't transitively aligned so the result you're getting is not a bug but an exact representation of the input alignments.

To try to write this out:

Anc AAATAAA
1   AAATAAA
2   AAATAAA
3   AAAgAAA
4   AAAgAAA

The upper-case characters match Anc. The lower-case gs aren't matched to anything, so seqwish does not close through them, nor map them to each other. It's not given any information to indicate that they match.

An all-to-all alignment would provide this, or adding in the alignments within each SNP or variant (but that seems tedious). Alternatively, you could push the output through smoothxg:

seqwish -p dummy.paf -s dummy.fa -g dummy.gfa
smoothxg -g dummy.gfa >dummy.smooth.gfa

This result matches your expectation:

H       VN:Z:1.0
S       1       AAA     DP:i:5  RC:i:15
L       1       +       3       +       0M
L       1       +       2       +       0M
S       2       T       DP:i:3  RC:i:3
L       2       +       4       +       0M
S       3       G       DP:i:2  RC:i:2
L       3       +       4       +       0M
S       4       AAA     DP:i:5  RC:i:15
P       Anc     1+,2+,4+        *
P       1       1+,2+,4+        *
P       2       1+,2+,4+        *
P       3       1+,3+,4+        *
P       4       1+,3+,4+        *

dummy smooth

glennhickey commented 4 years ago

Cool thanks. Kind of figured it was by design but thought I'd bring it up anyway, mostly to make me feel slightly less bad about forgetting to handle this in hal2vg

ekg commented 4 years ago

I don't think it's trivial to handle! Yeah, seqwish is meant to be a pretty "dumb" algorithm. It just takes the input that's given, with some optional filters (-k, -l, and -r) to remove or ignore exact matches of given lengths or implied local/global self-repeat count.