I am trying to run diamond blastp for detecting exact protein sequence matches. I am using a custom matrix that has 1's on the diagonal and -1's off diagonal. Gap open penalty = 12; gap extend = 2. I am running against a database built from UniRef100, excluding all clusters that have Uniparc representatives. I am using A0A8S9VRI7 as a test query sequence, with an N-terminal 6xHis-tag and TEV cleavage site (MHHHHHHENLYFQMDNNGVAKTL...). I get the following, somewhat inconsistent, results:
With the custom matrix, and masking and comp-based-stats disabled (--masking none --comp-based-stats 0) I don't get any matches.
Same diamond settings with different queries returns correct matches.
Masking and comp-based-stats disabled with BLOSUM62 and default penalties, matches the correct sequence.
Custom matrix and penalties, without disabling masking and comp based stats, also matches the correct sequence.
Not sure if this has something to do with the query being quite long (1714aa), or containing repetitive regions (sequences of Asparagines or Lysines). Do I need to specify any extra settings to make Diamond statistics work properly with my custom matrix? Anything else I might be doing wrong?
A different example, W7JKY7, also does not match the correct sequence with the custom matrix, even without disabling masking and comp-based stats. The correct hit is found with BLOSUM62.
For completeness, here is the diamond command line:
Hello.
Thank you for the great package!
I am trying to run
diamond blastp
for detecting exact protein sequence matches. I am using a custom matrix that has 1's on the diagonal and -1's off diagonal. Gap open penalty = 12; gap extend = 2. I am running against a database built from UniRef100, excluding all clusters that have Uniparc representatives. I am using A0A8S9VRI7 as a test query sequence, with an N-terminal 6xHis-tag and TEV cleavage site (MHHHHHHENLYFQMDNNGVAKTL...
). I get the following, somewhat inconsistent, results:--masking none --comp-based-stats 0
) I don't get any matches.Not sure if this has something to do with the query being quite long (1714aa), or containing repetitive regions (sequences of Asparagines or Lysines). Do I need to specify any extra settings to make Diamond statistics work properly with my custom matrix? Anything else I might be doing wrong?
A different example, W7JKY7, also does not match the correct sequence with the custom matrix, even without disabling masking and comp-based stats. The correct hit is found with BLOSUM62.
For completeness, here is the diamond command line:
Lambda and kappa printed by Diamond:
My custom matrix:
Thanks in advance.