bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.04k stars 182 forks source link

Inconsistency hit from several run #265

Closed StromTroopers closed 5 years ago

StromTroopers commented 5 years ago

Hi, I'm performing a diamond blast run.

Everything is working very well but I cannot find all the hits I find with blast+. I have a fasta file composed with 110 protein sequences, then I made a db with it.

When I use this database, I find all expected hit with blast+ instead of one. And when I juste put this sequence into the db alone and that I make a db with this sequence, then I fond the hit.

Here are the settings I used: diamond blastx -d seq -q genome_fasta.fa -o matches.m8 --more-sensitive --outfmt 6 --max-target-seqs 0 I do not really understand why I did not find the hit from the first blast since the sequence was present in this case as well ? The protein sequence in question is around 106aa long. Thank for you help.

bbuchfink commented 5 years ago

Hi, could you send me the files to reproduce this problem by email (buchfink@gmail.com) and also the exact command lines?

StromTroopers commented 5 years ago

It's done thank you.

bbuchfink commented 5 years ago

It's because the e-value depends on the size of the database. Consider this alignment against a single sequence db:

scaffold_159 YP_009345676.1 21.5 107 78 1 88433 88753 2 102 5.1e-05 38.1

The same alignment against a db of 108 sequences:

scaffold_159 YP_009345676.1 21.5 107 78 1 88433 88753 2 102 1.4e-02 38.1

As you can see, it now has an e-value of 1.4e-02 which is above the default e-value cutoff (1e-03).

StromTroopers commented 5 years ago

Ok I see what it is about now thank you for your time.

bbuchfink commented 5 years ago

The low complexity masking could be another reason for not finding a hit. You can try to run Diamond with the option --masking 0.

Am Fr., 8. Feb. 2019 um 14:41 Uhr schrieb Grendel26 < notifications@github.com>:

I had another question if you do not mind. It seems that even using a value e to 10, I do not find a hit for this protein. The length of this protein is about 100 aa, so it is relatively small compared to the ones I found. So I used the more sensitive mode, but that does not seem to be enough to find this a hit for this protein. What kind of settings do you advice me to use to treat small protein subjects in my database? Thank you again for you help.

bbuchfink commented 5 years ago

Sorry but I'm no expert on this issue of different substitution matrices, I'm afraid I can't offer you an insight there. You have to try what works best for your data.

Am Sa., 9. Feb. 2019 um 11:31 Uhr schrieb Grendek <notifications@github.com

:

-e 200\ #In order to get all hits --matrix BLOSUM80 --max-target-seqs 0 --outfmt 6 qseqid slen sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore I had some issues to find one hit "YP_009345676.1" which is quite small compare to the others. I can find it when I set the value threshold very high as you said. So I wondered if it was not also a question of substitutions matrix, that is why I checked the difference between all the matrix available and here are the results:

'BLOSUM 65 scaffold_159 106 YP_009345676.1 21.5 107 78 1 88433 88753 2 102 1.7e+01 38.1'

BLOSUM 45 scaffold_159 106 YP_009345676.1 21.5 107 78 1 88433 88753 2 102 9.8e+01 35.6

PAM250 scaffold_159 106 YP_009345676.1 21.5 107 78 1 88433 88753 2 102 1.2e+02 35.3

BLOSUM 90 scaffold_159 106 YP_001426562.1 71.4 14 4 0 401993 401952 19 32 1.8e+01 38.0

BLOSUM 80 scaffold_159 106 YP_001426562.1 71.4 14 4 0 401993 401952 19 32 2.0e+01 37.9

PAM70 scaffold_159 106 YP_001426562.1 71.4 14 4 0 401993 401952 19 32 1.5e+00 41.7

PAM30 scaffold_159 106 YP_001426562.1 71.4 14 4 0 401993 401952 19 32 2.4e-01 44.3

So I wondered, since the PAM30 displays the best evalue for the protein "YP_001426562.1", it was the best matrix to use for my data.

It is weird because I was expected to find a more relevant result with a matrix for more distantly related species such as BLOSUM45 (but here the e-value is even higher) because I'm working on virus proteins and they evolve very much quickly...

Do you know what are the best options in order to find this hit and keep a correct e-value even if the db is big? Is is because of the size of the protein?

Thank you.