bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.06k stars 182 forks source link

Hit not found for certain queries with custom 1/-1 matrix #819

Open tanhevg opened 4 months ago

tanhevg commented 4 months ago

Hello.

Thank you for the great package!

I am trying to run diamond blastp for detecting exact protein sequence matches. I am using a custom matrix that has 1's on the diagonal and -1's off diagonal. Gap open penalty = 12; gap extend = 2. I am running against a database built from UniRef100, excluding all clusters that have Uniparc representatives. I am using A0A8S9VRI7 as a test query sequence, with an N-terminal 6xHis-tag and TEV cleavage site (MHHHHHHENLYFQMDNNGVAKTL...). I get the following, somewhat inconsistent, results:

Not sure if this has something to do with the query being quite long (1714aa), or containing repetitive regions (sequences of Asparagines or Lysines). Do I need to specify any extra settings to make Diamond statistics work properly with my custom matrix? Anything else I might be doing wrong?

A different example, W7JKY7, also does not match the correct sequence with the custom matrix, even without disabling masking and comp-based stats. The correct hit is found with BLOSUM62.

For completeness, here is the diamond command line:

diamond blastp -q A0A8S9VRI7.fasta \
    -d ~/diamond_db/uniref100_no_uniparc.dmnd -o A0A8S9VRI7.tsv \
    --custom-matrix diamond_matrix.txt --gapopen 12 --gapextend 2 \
    --header --tmpdir /dev/shm --fast -b 40 --max-target-seqs 100 \
    --evalue 1e-6 --id 80 --max-hsps 10 --masking none --comp-based-stats 0 \
    --threads 48

Lambda and kappa printed by Diamond:

Scoring parameters: (Matrix=custom Lambda=2.77945 K=0.829449 Penalties=12/2)

My custom matrix:

   A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y  X  Z
A  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
C -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
D -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
E -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
F -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
G -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
H -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
I -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
K -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
L -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
M -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
N -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
P -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Q -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1
R -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1
S -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1
T -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1
V -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1
W -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1
Y -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1
Z -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1

Thanks in advance.

tanhevg commented 4 months ago

somehow it appears that using --no-ranking solves this problem

bbuchfink commented 2 weeks ago

Ok, let me know if you still need any help.