attractivechaos / klib

A standalone and lightweight C library
http://attractivechaos.github.io/klib/
MIT License
4.18k stars 556 forks source link

Can you describe the difference between the various ksw_* #101

Closed nh13 closed 6 years ago

nh13 commented 6 years ago

@attractivechaos @lh3

I am trying to figure out the types of alignment based on the function names, guessing the following:

** doesn't behave this way, which is why I am asking

Also, I have a ksw_glocal implementation if you are looking for a contribution.

attractivechaos commented 6 years ago

ksw is largely deprecated. ksw2 is generally the way to go, although some ksw functionality is missing from ksw2.

nh13 commented 6 years ago

@attractivechaos thanks, it looks like ksw_extend isn't working for the following:

query = GATTAC
target = AAAAGATTACAAAAA

It reports a query end of 2 (_qle), target end of 1 (_tle), and a score of 1. I assume given it is "extend" the query start and end are both 0. See the details below.

To me, it looks like it allows the alignment to start anywhere:

Thanks for any help you can offer.

``` $ cat target.fasta >target AAAAGATTACAAAAA $ cat query.fasta >query GATTAC $ ./ksw target.fasta query.fasta target 4 10 query 0 6 6 1 0 target 0 2 query 4 2 2 2 6 ksw_extend 1 2 1 $ git diff diff --git a/ksw.c b/ksw.c index 742fec9..45b602c 100644 --- a/ksw.c +++ b/ksw.c @@ -534,7 +534,7 @@ int ksw_global(int qlen, const uint8_t *query, int tlen, const uint8_t *target, * Main function (not compiled by default) * *******************************************/ -#ifdef _KSW_MAIN +//#ifdef _KSW_MAIN #include #include @@ -622,6 +622,11 @@ int main(int argc, char *argv[]) if (r.score >= minsc) printf("%s\t%d\t%d\t%s\t%d\t%d\t%d\t%d\t%d\n", kst->name.s, r.tb, r.te+1, ksq->name.s, (int)ksq->seq.l - r.qb, (int)ksq->seq.l - 1 - r.qe, r.score, r.score2, r.te2); } + + int qle, tle; + int s = ksw_extend(ksq->seq.l, (uint8_t*)ksq->seq.s, kst->seq.l, (uint8_t*)kst->seq.s, 5, mat, gapo, gape, 10000, 0, &qle, &tle); + printf("ksw_extend\t%d\t%d\t%d\n", s, qle, tle); + } free(q[0]); free(q[1]); } @@ -630,4 +635,4 @@ int main(int argc, char *argv[]) kseq_destroy(ksq); gzclose(fpq); return 0; } -#endif +//#endif ```

Thanks for your help!

attractivechaos commented 6 years ago

Extension is also called seed extension. You have to have a seed hit before calling it, or it won't give you meaningful result. Also, don't use ksw. Use ksw2.

nh13 commented 6 years ago

I see, you don't feed the sequence after the seed, you feed in the sequence with the seed included.

ksw2 doesn't have local or glocal. It was much easier to use ksw_local, and adapt ksw_global to create ksw_glocal.

lh3 commented 6 years ago

Local alignment is ok. ksw_global is buggy and I have no plan to fix it. Don't use.

lh3 commented 6 years ago

Forgot to say – ksw2 has a global alignment implementation, which is more correct.

nh13 commented 6 years ago

Thanks Heng, this is great information. I will be porting over the global, extension, and glocal (an overloaded term) to use ksw2. The latter is the full query to a sub-sequence of the target, so I think I can just use global and trim off the start/end deletions.

I'll keep relying on klib's ksw_local until such time that ksw2 has a local implementation.

Thanks for the quick responses and helpful insights as always.