ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 18 forks source link

path length changes in graph construction #20

Closed ChriKub closed 5 years ago

ChriKub commented 5 years ago

I use seqwish to align 17 sequences of varying length (169bp - 30082bp), but after the graph inducement the (path) length of sequence >5 changes from 30082bp to 30805bp, thus changing the sequence content. minimap2 bubbles_sub.fasta bubbles_sub.fasta -c -X > bubbles_sub.paf seqwish -p bubbles_sub.paf -s bubbles_sub.fasta -b bubbles_sub.graph -g bubbles_sub.gfa -r 21

Fasta Download

ekg commented 5 years ago

Thanks for sharing this. What commands did you use?

I thought that there was a check for path integrity that will blow up when the sequence is altered. There could be some other kind of corruption. If it happens at the point of input then this wouldn't be caught.

On Thu, Aug 15, 2019, 11:03 ChriKub notifications@github.com wrote:

I use seqwish to align 17 sequences of varying length (169bp - 30082bp), but after the graph inducement the (path) length of sequence >5 changes from 30082bp to 30805bp, thus changing the sequence content. minimap2 bubbles_sub.fasta bubbles_sub.fasta -c -X > bubbles_sub.paf seqwish -p bubbles_sub.paf -s bubbles_sub.fasta -b bubbles_sub.graph -g bubbles_sub.gfa -r 21

Fasta Download https://owncloud.tuebingen.mpg.de/index.php/s/RbWjZM3nQG5kaNc

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/20?email_source=notifications&email_token=AABDQEPDHGTIBY25U2KFVVDQEULXLA5CNFSM4IL4NH62YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HFMNEYQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEO7MQJEQQOM7Q5RTPTQEULXLANCNFSM4IL4NH6Q .

ChriKub commented 5 years ago

I ran these commands, using both the latest version of minimap2 and seqwish. I then loaded the gfa into vg(v.1.14) and used vg paths to check the path length:

minimap2 bubbles_sub.fasta bubbles_sub.fasta -c -X > bubbles_sub.paf
seqwish -p bubbles_sub.paf -s bubbles_sub.fasta -b bubbles_sub.graph -g bubbles_sub.gfa -r 21
vg view -F -v bubbles_sub.gfa > bubbles_sub.vg
vg paths -E -v bubbles_sub.vg
ekg commented 5 years ago

Do you get the same problem without -r21?

On Thu, Aug 15, 2019, 11:23 ChriKub notifications@github.com wrote:

I ran these commands, using both the latest version of minimap2 and seqwish. I then loaded the gfa into vg(v.1.14) and used vg paths to check the path length:

minimap2 bubbles_sub.fasta bubbles_sub.fasta -c -X > bubbles_sub.paf seqwish -p bubbles_sub.paf -s bubbles_sub.fasta -b bubbles_sub.graph -g bubbles_sub.gfa -r 21 vg view -F -v bubbles_sub.gfa > bubbles_sub.vg vg paths -E -v bubbles_sub.vg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/20?email_source=notifications&email_token=AABDQEKSQQDAK7ITSFVSZI3QEUOCDA5CNFSM4IL4NH62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4LKGEI#issuecomment-521577233, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEPLSHCBUFOKAEJV5TDQEUOCDANCNFSM4IL4NH6Q .

ChriKub commented 5 years ago

Yes. Same result for the path length.

ChriKub commented 5 years ago

I uploaded the fasta files to reproduce this in the first post.

ekg commented 5 years ago

Thanks, I've reproduced the problem. I'll see what's up.

On Fri, Aug 16, 2019 at 8:35 AM ChriKub notifications@github.com wrote:

I uploaded the fasta files to reproduce this in the first post.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/20?email_source=notifications&email_token=AABDQEPH34E7DZDKYUCD4N3QEZDBXA5CNFSM4IL4NH62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4NZUFI#issuecomment-521902613, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQENMRZV7NHYRCI4JUA3QEZDBXANCNFSM4IL4NH6Q .

ekg commented 5 years ago

It looks like there's been a problem related to this for some time (since 87b6a85 at least):

time minimap2 -c -x asm20 -X -t 4 DRB1-3123.fa DRB1-3123.fa | fpa drop -l 3000 | gzip >DRB1-3123.paf.gz && time ~/seqwish/bin/seqwish-87b6a85 -s DRB1-3123.fa -p DRB1-3123.paf.gz -b DRB1-3123.work -g DRB1-3123.gfa-87b6a85  && vg view -Fv DRB1-3123.gfa-87b6a85 | vg paths -v - -F >DRB1-3123.gfa-87b6a85.fasta && samtools faidx DRB1-3123.gfa-87b6a85.fasta

But when comparing with the current state, it's not exactly the same issue:

erik@foldy [05:36:26 PM] [~/seqwish/test] [path-bug]
-> % paste <(sort -V DRB1-3123.fa.fai)  <(sort -V DRB1-3123.gfa.fasta.fai ) | awk '{ print $7 - $2 }'
0
0
0
1
-44
0
0
0
0
0
0
0
erik@foldy [05:36:35 PM] [~/seqwish/test] [path-bug]
-> % paste <(sort -V DRB1-3123.fa.fai)  <(sort -V DRB1-3123.gfa-87b6a85.fasta.fai ) | awk '{ print $7 - $2 }'
0
0
0
46
2
0
0
0
0
0
0
0

This should have been tested before, sorry about this.

ekg commented 5 years ago

I found the problem. I wasn't splitting nodes at the start and end of paths. A fix should be in soon.