Closed tolot27 closed 10 years ago
@tolot27 Thank you, I will give it a look!
I made some further comparison with ssearch, swipe, swimd and sswlib and the same parameters and matrices as above. Query is again the first sequence of ./test_data/db/uniprot_sprot15.fasta
(sp|Q6GZX4|001R_FRG3G), database is again ./test_data/db/uniprot_sprot15.fasta
. Differences are marked in boldface.
subject_id | ssearch | swipe | swimd | sswlib |
---|---|---|---|---|
sp|Q6GZX4|001R_FRG3G | 1734 | 1734 | 1734 | 1734 |
sp|Q6GZX3|002L_FRG3G | 45 | 45 | 45 | 45 |
sp|Q197F8|002R_IIV3 | 41 | 41 | 41 | 41 |
sp|Q197F7|003L_IIV3 | 41 | 41 | 43 | 43 |
sp|Q6GZX2|003R_FRG3G | 38 | 38 | 38 | 38 |
sp|Q6GZX1|004R_FRG3G | 24 | 24 | 24 | 24 |
sp|Q197F5|005L_IIV3 | 36 | 36 | 42 | 42 |
sp|Q6GZX0|005R_FRG3G | 37 | 37 | 41 | 41 |
sp|Q91G88|006L_IIV6 | 51 | 51 | 56 | 56 |
sp|Q6GZW9|006R_FRG3G | 33 | 33 | 35 | 35 |
sp|Q6GZW8|007R_FRG3G | 37 | 37 | 37 | 37 |
sp|Q197F3|007R_IIV3 | 50 | 50 | 52 | 52 |
sp|Q197F2|008L_IIV3 | 37 | 37 | 37 | 37 |
sp|Q6GZW6|009L_FRG3G | 40 | 40 | 44 | 44 |
sp|Q91G85|009R_IIV6 | 30 | 30 | 30 | 30 |
Interestingly, swimd and sswlib gave always the same score beside the fact that they use inter- vs. intra-sequence comparison.
I'll dig further.
Hm, very interesting, sounds like maybe it is not a bug. It is interesting that they are always giving smaller scores. Can we get alignment, or at least starting position, and see if they are all possible?
Oh yes, one more thing: did you check how gap opening and gap extension penalty are defined? Sometimes gap opening already includes first extension, and sometimes not. For example, for gap of length 3, gap opening penalty of 10 and gap extension of 1, it can be different result depending on previous definition: it can be 10 + 1 + 1 + 1 = 13 or 10 + 1 + 1 = 12.
Okay, found it. First of all, I accidentially used a hard-masked database with swipe. After using unmasked database, swipe returns almost the same scores as ssearch.
Then I checked swimd an fixed it as follows:
diff --git a/src/Swimd.cpp b/src/Swimd.cpp
index fcb68ac..54254e3 100644
--- a/src/Swimd.cpp
+++ b/src/Swimd.cpp
@@ -232,10 +232,10 @@ static int searchDatabaseSW_(unsigned char query[], int queryLength,
// ----------------------- CORE LOOP (ONE COLUMN) ----------------------- //
for (int r = 0; r < queryLength; r++) { // For each cell in column
// Calculate E = max(lH-Q, lE-R)
- __mxxxi E = SIMD::max(SIMD::sub(prevHs[r], Q), SIMD::sub(prevEs[r], R));
+ __mxxxi E = SIMD::sub(SIMD::max(SIMD::sub(prevHs[r], Q), prevEs[r]), R);
// Calculate F = max(uH-Q, uF-R)
- __mxxxi F = SIMD::max(SIMD::sub(uH, Q), SIMD::sub(uF, R));
+ __mxxxi F = SIMD::sub(SIMD::max(SIMD::sub(uH, Q), uF), R);
// Calculate H
__mxxxi H = SIMD::max(F, E);
Now, swimd returns the same scores than swipe and ssearch, at least for this 15 hits. Now, I'll check sswlib.
Fixed the sswlib too: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/14
Hi Mathias,
Thank you for your revision for sswlib. I was wondering would you mind to help to have a check of the following:
The gap opening penalty's definition of sswlib and ssearch are different. For example, if you set the gap opening parameter as 10, extension penalty as 2, when open a gap, sswlib use 10, while ssearch use 12. Please have a check of the parameter settings when you compare sswlib with ssearch.
If you do find they give different results, when they are using equivalent parameters, please let me know.
Many thanks,
Mengyao
I've checked the gap penalties again and you are right. The reason why the scores of sswlib and swimd differ from the scores of ssearch and swipe is the different use of the gap open penalty as you wrote. I even checked again the NCBI Blast C++ Toolkit. There, the opening penalty is also computed by adding the gap extension cost to the gap open cost - returning the same results as swipe and ssearch.
What was your and/or Martinsos decision to implement the gap penalties differently than Pearson, Altschul or Rognes - beside the fact that IMHO you implemented it as the natural understanding is. ;-)
BLAST and FASTA are de facto standard. Wouldn't it be better to keep compatibility to this standard - even if you have to give up the natural understanding of gap opening penalty? In this case, we have to rename the issues and mention compatibility instead of 'incorrect'. ;-)
BTW: Did you find a documentation of the penalty implementations in Blast or ssearch? It is neither explained in the programs command line help nor at their websites - as far as I found it.
Dear Martinsos,
Thank you for bringing up this issue.
The two different definitions of gap open exist all the time, and there is no standard.
I think I would like to keep sswlib’s definition as its current status. I just added a note of this gap open definition in the README file.
It is good to make it clear to the users.
Best wishes,
Mengyao
On Jan 15, 2014, at 1:20 PM, Mathias Walter notifications@github.com wrote:
I've checked the gap penalties again and you are right. The reason why the scores of sswlib and swimd differ from the scores of ssearch and swipe is the different use of the gap open penalty as you wrote. I even checked again the NCBI Blast C++ Toolkit. There, the opening penalty is also computed by adding the gap extension cost to the gap open cost - returning the same results as swipe and ssearch.
What was your and/or Martinsos decision to implement the gap penalties differently than Pearson, Altschul or Rognes - beside the fact that IMHO you implemented it as the natural understanding is. ;-)
BLAST and FASTA are de facto standard. Wouldn't it be better to keep compatibility to this standard - even if you have to give up the natural understanding of gap opening penalty? In this case, we have to rename the issues and mention compatibility instead of 'incorrect'. ;-)
BTW: Did you find a documentation of the penalty implementations in Blast or ssearch? It is neither explained in the programs command line help nor at their websites - as far as I found it.
— Reply to this email directly or view it on GitHub.
Dear @mengyao and @tolot27, I am glad we solved this. In that case I will also keep the current definition since another one is not any better and some people are already using it as it is, so a change could introduce confusion! I will add additional comments to make this more obvious.
Added explanations in commit 91caf5aab8f7d10a5178e71c64a670c3aafd5783! Thanks @tolot27 again for pointing this out.
By comparison with swipe I observed that the scores computed by swimd are often incorrect. Since I did a comparison between swipe and ssearch sometimes ago, I trust the scores of swipe.
Compare for instance the first sequence of ./test_data/db/uniprot_sprot15.fasta with the fourth sequence of the same file with blosum50 and gap open of 13 and gap extend of 2. Swipe returns a raw score of 41, whereas swimd returns 43.
BTW: I've compared the scoring matrices and they are identical.