Open dspeth opened 2 years ago
One reason could be the ranking heuristic, which could be more pronounced if you look for hits down to 10 bitscore. Can you try using --no-ranking
?
didn't have time to run tests today, will get back to you in a few days
Here's a different dataset (forward reads from SRR2605792) used as query to search a nirK protein database. The axes show the raw score, not the bit score.
with --no-ranking
flag set :
without --no-ranking
flag set:
I've also tried to up the --min-score
parameter to 100 resulting in these two plots.
with --no-ranking
flag set:
plot100_norank.pdf
without --no-ranking
flag set:
plot100.pdf
setting the --no-ranking
parameter did not change anything, and setting a much higher minimum score also did not resolve the issue.
I did a quick test aligning a sample of 10,000 reads from the SRA dataset you mentioned against the NR, using -k1
and using -k0
(report all hits). The top hit was different for 2/10,000 reads, so I can't reproduce the problem (unless that is the ratio that you also observe). Could you make an example of a query with a better hit against the subdatabase that is not found on the larger database?
the issue
in my hands, diamond blastx doesn't always return the highest scoring hit with the
-k 1
flag set.background
When using DIAMOND for searching illumina read datasets for matches against protein families of interest, I tend to use a "alignment score ratio" approach. In my implementation of that approach, I use the reads as a query against a database containing sequences of a protein of interest, and then use the reads with a match in the first search to search a the protein complement of the GTDB species set. The alignment score ratio is then the ratio between those two scores.
The database of the protein of interest only contains sequences that are also in the protein complement of the GTDB species set, as that is the source of the sequences in the database. In theory, that means the maximum ratio between the two should be 1, but I observe ratios above 1.
In essence, when searching against a small dataset I find (in some cases) better scoring hits than when searching against a very large dataset (that contains the smaller dataset entirely) with the same reads. As a workaround I have increased the max targets using
-k 50
and then picking the highest scoring hit of the 50, but there are still cases where the same hit is not within the top 50 of the hits against the larger database.mock command
Here's an example of the command as I run it for both searches: diamond blastx -q read_dataset -d protein_db -o outfile -p 50 -k 1 --outfmt 6 --matrix blosum45 --gapopen 14 --gapextend 2 --masking 0 --min-score 10 --comp-based-stats 0 -b 12 -c 1
version etc
v2.0.14.152 Happy to provide output files if that'd help diagnose
visual example
Here's a visual depiction of the issue. The x axis is the score against the (large) outgroup database, the y axis is the score against the database of the protein of interest. Each dot is a read, color of the dot represents sequence identity against the database of the protein of interest>