bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.07k stars 182 forks source link

random output using long read mode #728

Open xinehc opened 1 year ago

xinehc commented 1 year ago

Hi,

I noticed that sometime the results is not reproducible using the long read mode, I wonder is it related to the internal seed or tie breaking?

Here is a minimal example: diamond blastx \ --db protein.txt \ --query query.txt \ --outfmt 6 qseqid sseqid pident length qlen qstart qend slen sstart send evalue bitscore \ --range-culling -F 15 \ --max-target-seqs 25 --max-hsps 0 \ --threads 16

First run:

JAHHUZ010000004.1   WP_291717584.1  89.9    119 158376  144690  145046  119 1   119 1.31e-68    216
JAHHUZ010000004.1   WP_106008235.1  89.1    119 158376  13687   13331   119 1   119 1.78e-68    216

Second run:

JAHHUZ010000004.1   WP_291717584.1  89.9    119 158376  13687   13331   119 1   119 1.31e-68    216
JAHHUZ010000004.1   WP_106008235.1  89.1    119 158376  13687   13331   119 1   119 1.78e-68    216

Third run:

JAHHUZ010000004.1   WP_291717584.1  89.9    119 158376  144690  145046  119 1   119 1.31e-68    216
JAHHUZ010000004.1   WP_106008235.1  89.1    119 158376  144690  145046  119 1   119 1.78e-68    216

protein.txt query.txt

Update: this seem to be a corner case, the reverse complement of the sequence is the same as the original sequence:

seqkit grep -p 'JAHHUZ010000004.1' query.txt > contig.txt
seqtk seq contig.txt > forward.txt
seqtk seq -r contig.txt > reverse.txt
md5sum forward.txt reverse.txt

89c390f039087b7265e794b35912e0ba  forward.txt
89c390f039087b7265e794b35912e0ba  reverse.txt
bbuchfink commented 1 year ago

There seems to be a non-determinism in reporting equally scoring hits. I will fix this when I get the chance.