Martinsos / opal

SIMD C/C++ library for massive optimal sequence alignment (local/SW, infix, overlap, global)
MIT License
35 stars 10 forks source link

scores are often incorrect #10

Closed tolot27 closed 10 years ago

tolot27 commented 10 years ago

By comparison with swipe I observed that the scores computed by swimd are often incorrect. Since I did a comparison between swipe and ssearch sometimes ago, I trust the scores of swipe.

Compare for instance the first sequence of ./test_data/db/uniprot_sprot15.fasta with the fourth sequence of the same file with blosum50 and gap open of 13 and gap extend of 2. Swipe returns a raw score of 41, whereas swimd returns 43.

./swimd_aligner -m Blosum50 -o 13 -e 2 ../test_data/db/uniprot_sprot1.fasta ../test_data/db/uniprot_sprot1_4.fasta

./swipe -M BLOSUM50 -G 13 -E 2 -i ../swimd/test_data/db/uniprot_sprot1.fasta -d ../swimd/test_data/db/uniprot_sprot1_4.fasta -m 7 -v 15 -b 15 -e 10000 | grep score

BTW: I've compared the scoring matrices and they are identical.

Martinsos commented 10 years ago

@tolot27 Thank you, I will give it a look!

tolot27 commented 10 years ago

I made some further comparison with ssearch, swipe, swimd and sswlib and the same parameters and matrices as above. Query is again the first sequence of ./test_data/db/uniprot_sprot15.fasta (sp|Q6GZX4|001R_FRG3G), database is again ./test_data/db/uniprot_sprot15.fasta. Differences are marked in boldface.

subject_id ssearch swipe swimd sswlib
sp|Q6GZX4|001R_FRG3G 1734 1734 1734 1734
sp|Q6GZX3|002L_FRG3G 45 45 45 45
sp|Q197F8|002R_IIV3 41 41 41 41
sp|Q197F7|003L_IIV3 41 41 43 43
sp|Q6GZX2|003R_FRG3G 38 38 38 38
sp|Q6GZX1|004R_FRG3G 24 24 24 24
sp|Q197F5|005L_IIV3 36 36 42 42
sp|Q6GZX0|005R_FRG3G 37 37 41 41
sp|Q91G88|006L_IIV6 51 51 56 56
sp|Q6GZW9|006R_FRG3G 33 33 35 35
sp|Q6GZW8|007R_FRG3G 37 37 37 37
sp|Q197F3|007R_IIV3 50 50 52 52
sp|Q197F2|008L_IIV3 37 37 37 37
sp|Q6GZW6|009L_FRG3G 40 40 44 44
sp|Q91G85|009R_IIV6 30 30 30 30

Interestingly, swimd and sswlib gave always the same score beside the fact that they use inter- vs. intra-sequence comparison.

I'll dig further.

Martinsos commented 10 years ago

Hm, very interesting, sounds like maybe it is not a bug. It is interesting that they are always giving smaller scores. Can we get alignment, or at least starting position, and see if they are all possible?

Martinsos commented 10 years ago

Oh yes, one more thing: did you check how gap opening and gap extension penalty are defined? Sometimes gap opening already includes first extension, and sometimes not. For example, for gap of length 3, gap opening penalty of 10 and gap extension of 1, it can be different result depending on previous definition: it can be 10 + 1 + 1 + 1 = 13 or 10 + 1 + 1 = 12.

tolot27 commented 10 years ago

Okay, found it. First of all, I accidentially used a hard-masked database with swipe. After using unmasked database, swipe returns almost the same scores as ssearch.

Then I checked swimd an fixed it as follows:

diff --git a/src/Swimd.cpp b/src/Swimd.cpp
index fcb68ac..54254e3 100644
--- a/src/Swimd.cpp
+++ b/src/Swimd.cpp
@@ -232,10 +232,10 @@ static int searchDatabaseSW_(unsigned char query[], int queryLength,
         // ----------------------- CORE LOOP (ONE COLUMN) ----------------------- //
         for (int r = 0; r < queryLength; r++) { // For each cell in column
             // Calculate E = max(lH-Q, lE-R)
-            __mxxxi E = SIMD::max(SIMD::sub(prevHs[r], Q), SIMD::sub(prevEs[r], R));
+            __mxxxi E = SIMD::sub(SIMD::max(SIMD::sub(prevHs[r], Q), prevEs[r]), R);

             // Calculate F = max(uH-Q, uF-R)
-            __mxxxi F = SIMD::max(SIMD::sub(uH, Q), SIMD::sub(uF, R));
+            __mxxxi F = SIMD::sub(SIMD::max(SIMD::sub(uH, Q), uF), R);

             // Calculate H
             __mxxxi H = SIMD::max(F, E);

Now, swimd returns the same scores than swipe and ssearch, at least for this 15 hits. Now, I'll check sswlib.

tolot27 commented 10 years ago

Fixed the sswlib too: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/14

mengyao commented 10 years ago

Hi Mathias,

Thank you for your revision for sswlib. I was wondering would you mind to help to have a check of the following:

The gap opening penalty's definition of sswlib and ssearch are different. For example, if you set the gap opening parameter as 10, extension penalty as 2, when open a gap, sswlib use 10, while ssearch use 12. Please have a check of the parameter settings when you compare sswlib with ssearch.

If you do find they give different results, when they are using equivalent parameters, please let me know.

Many thanks,

Mengyao

tolot27 commented 10 years ago

I've checked the gap penalties again and you are right. The reason why the scores of sswlib and swimd differ from the scores of ssearch and swipe is the different use of the gap open penalty as you wrote. I even checked again the NCBI Blast C++ Toolkit. There, the opening penalty is also computed by adding the gap extension cost to the gap open cost - returning the same results as swipe and ssearch.

What was your and/or Martinsos decision to implement the gap penalties differently than Pearson, Altschul or Rognes - beside the fact that IMHO you implemented it as the natural understanding is. ;-)

BLAST and FASTA are de facto standard. Wouldn't it be better to keep compatibility to this standard - even if you have to give up the natural understanding of gap opening penalty? In this case, we have to rename the issues and mention compatibility instead of 'incorrect'. ;-)

BTW: Did you find a documentation of the penalty implementations in Blast or ssearch? It is neither explained in the programs command line help nor at their websites - as far as I found it.

mengyao commented 10 years ago

Dear Martinsos,

Thank you for bringing up this issue.

The two different definitions of gap open exist all the time, and there is no standard.

I think I would like to keep sswlib’s definition as its current status. I just added a note of this gap open definition in the README file.

It is good to make it clear to the users.

Best wishes,

Mengyao

On Jan 15, 2014, at 1:20 PM, Mathias Walter notifications@github.com wrote:

I've checked the gap penalties again and you are right. The reason why the scores of sswlib and swimd differ from the scores of ssearch and swipe is the different use of the gap open penalty as you wrote. I even checked again the NCBI Blast C++ Toolkit. There, the opening penalty is also computed by adding the gap extension cost to the gap open cost - returning the same results as swipe and ssearch.

What was your and/or Martinsos decision to implement the gap penalties differently than Pearson, Altschul or Rognes - beside the fact that IMHO you implemented it as the natural understanding is. ;-)

BLAST and FASTA are de facto standard. Wouldn't it be better to keep compatibility to this standard - even if you have to give up the natural understanding of gap opening penalty? In this case, we have to rename the issues and mention compatibility instead of 'incorrect'. ;-)

BTW: Did you find a documentation of the penalty implementations in Blast or ssearch? It is neither explained in the programs command line help nor at their websites - as far as I found it.

— Reply to this email directly or view it on GitHub.

Martinsos commented 10 years ago

Dear @mengyao and @tolot27, I am glad we solved this. In that case I will also keep the current definition since another one is not any better and some people are already using it as it is, so a change could introduce confusion! I will add additional comments to make this more obvious.

Martinsos commented 10 years ago

Added explanations in commit 91caf5aab8f7d10a5178e71c64a670c3aafd5783! Thanks @tolot27 again for pointing this out.