bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.04k stars 182 forks source link

diamond blastx doesn't always return highest scoring hit with -k 1 #616

Open dspeth opened 2 years ago

dspeth commented 2 years ago

the issue

in my hands, diamond blastx doesn't always return the highest scoring hit with the -k 1 flag set.

background

When using DIAMOND for searching illumina read datasets for matches against protein families of interest, I tend to use a "alignment score ratio" approach. In my implementation of that approach, I use the reads as a query against a database containing sequences of a protein of interest, and then use the reads with a match in the first search to search a the protein complement of the GTDB species set. The alignment score ratio is then the ratio between those two scores.

The database of the protein of interest only contains sequences that are also in the protein complement of the GTDB species set, as that is the source of the sequences in the database. In theory, that means the maximum ratio between the two should be 1, but I observe ratios above 1.
In essence, when searching against a small dataset I find (in some cases) better scoring hits than when searching against a very large dataset (that contains the smaller dataset entirely) with the same reads. As a workaround I have increased the max targets using -k 50 and then picking the highest scoring hit of the 50, but there are still cases where the same hit is not within the top 50 of the hits against the larger database.

mock command

Here's an example of the command as I run it for both searches: diamond blastx -q read_dataset -d protein_db -o outfile -p 50 -k 1 --outfmt 6 --matrix blosum45 --gapopen 14 --gapextend 2 --masking 0 --min-score 10 --comp-based-stats 0 -b 12 -c 1

version etc

v2.0.14.152 Happy to provide output files if that'd help diagnose

visual example

Here's a visual depiction of the issue. The x axis is the score against the (large) outgroup database, the y axis is the score against the database of the protein of interest. Each dot is a read, color of the dot represents sequence identity against the database of the protein of interest> image

bbuchfink commented 2 years ago

One reason could be the ranking heuristic, which could be more pronounced if you look for hits down to 10 bitscore. Can you try using --no-ranking?

dspeth commented 2 years ago

didn't have time to run tests today, will get back to you in a few days

dspeth commented 2 years ago

Here's a different dataset (forward reads from SRR2605792) used as query to search a nirK protein database. The axes show the raw score, not the bit score.

with --no-ranking flag set :

Screenshot 2022-10-09 at 17 55 47

without --no-ranking flag set:

Screenshot 2022-10-09 at 17 56 25

I've also tried to up the --min-score parameter to 100 resulting in these two plots. with --no-ranking flag set: plot100_norank.pdf

without --no-ranking flag set: plot100.pdf

dspeth commented 2 years ago

setting the --no-ranking parameter did not change anything, and setting a much higher minimum score also did not resolve the issue.

bbuchfink commented 2 years ago

I did a quick test aligning a sample of 10,000 reads from the SRA dataset you mentioned against the NR, using -k1 and using -k0 (report all hits). The top hit was different for 2/10,000 reads, so I can't reproduce the problem (unless that is the ratio that you also observe). Could you make an example of a query with a better hit against the subdatabase that is not found on the larger database?