Hit not found for certain queries with custom 1/-1 matrix

tanhevg commented 4 months ago

Hello.

Thank you for the great package!

I am trying to run diamond blastp for detecting exact protein sequence matches. I am using a custom matrix that has 1's on the diagonal and -1's off diagonal. Gap open penalty = 12; gap extend = 2. I am running against a database built from UniRef100, excluding all clusters that have Uniparc representatives. I am using A0A8S9VRI7 as a test query sequence, with an N-terminal 6xHis-tag and TEV cleavage site (MHHHHHHENLYFQMDNNGVAKTL...). I get the following, somewhat inconsistent, results:

With the custom matrix, and masking and comp-based-stats disabled (--masking none --comp-based-stats 0) I don't get any matches.
Same diamond settings with different queries returns correct matches.
Masking and comp-based-stats disabled with BLOSUM62 and default penalties, matches the correct sequence.
Custom matrix and penalties, without disabling masking and comp based stats, also matches the correct sequence.

Not sure if this has something to do with the query being quite long (1714aa), or containing repetitive regions (sequences of Asparagines or Lysines). Do I need to specify any extra settings to make Diamond statistics work properly with my custom matrix? Anything else I might be doing wrong?

A different example, W7JKY7, also does not match the correct sequence with the custom matrix, even without disabling masking and comp-based stats. The correct hit is found with BLOSUM62.

For completeness, here is the diamond command line:

diamond blastp -q A0A8S9VRI7.fasta \
    -d ~/diamond_db/uniref100_no_uniparc.dmnd -o A0A8S9VRI7.tsv \
    --custom-matrix diamond_matrix.txt --gapopen 12 --gapextend 2 \
    --header --tmpdir /dev/shm --fast -b 40 --max-target-seqs 100 \
    --evalue 1e-6 --id 80 --max-hsps 10 --masking none --comp-based-stats 0 \
    --threads 48

Lambda and kappa printed by Diamond:

Scoring parameters: (Matrix=custom Lambda=2.77945 K=0.829449 Penalties=12/2)

My custom matrix:

   A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y  X  Z
A  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
C -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
D -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
E -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
F -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
G -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
H -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
I -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
K -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
L -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
M -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
N -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
P -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Q -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1
R -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1
S -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1
T -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1
V -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1
W -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1
Y -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1
Z -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1

Thanks in advance.

tanhevg commented 4 months ago

somehow it appears that using --no-ranking solves this problem

bbuchfink commented 2 weeks ago

Ok, let me know if you still need any help.

bbuchfink / diamond

Hit not found for certain queries with custom 1/-1 matrix #819