bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.06k stars 182 forks source link

How many processing query blocks are there in total? #598

Open ZiliaMR opened 2 years ago

ZiliaMR commented 2 years ago

Hello,

I am running diamond with these parameters:

for seq in `cat list_t.txt`
do
echo $seq
diamond blastx -d $DIAMOND_NR -q 'fastq-join/'$seq'_merged.fastq' -o $seq'.daa' -f 100
done

and in the terminal I see this:

#CPU threads: 20
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory:
#Target sequences to report alignments for: 25
Opening the database...  [0.41s]
Database: /sw/data/diamond_databases/Blast/latest/nr.dmnd (type: Diamond database, sequences: 488073487, letters: 188883749278)
Block size = 2000000000
Opening the input file...  [0.311s]
Opening the output file...  [0.001s]
Loading query sequences...  [25.424s]
Masking reference...  [16.125s]
Initializing dictionary...  [0.009s]
Initializing temporary storage...  [0.009s]
Building reference histograms...  [17.566s]
Allocating buffers...  [0s]
Processing query block 1, reference block 1/95, shape 1/2, index chunk 1/4.
Building reference seed array...  [5.396s]
Building query seed array...  [4.49s]
Computing hash join...  [4.521s]
Masking low complexity seeds...  [1.689s]
Searching alignments...  [27.369s]
Processing query block 1, reference block 1/95, shape 1/2, index chunk 2/4.
Building reference seed array...  [6.79s]

... Processing for query block 1 took about one day. And I see that it has just started with query block 2. My question is: How many queries block will there be in total? How can I know this?

I would appreciate your help with this question.

Processing query block 2, reference block 13/95, shape 1/2, index chunk 1/4.
Building reference seed array...  [5.431s]
Building query seed array...  [4.65s]
Computing hash join...  [5.072s]
Masking low complexity seeds...  [1.868s]
Searching alignments...  [28.261s]
Processing query block 2, reference block 13/95, shape 1/2, index chunk 2/4.
Building reference seed array...  [6.71s]
Building query seed array...  [5.689s]
Computing hash join...  [4.538s]
Masking low complexity seeds...  [1.597s]

thanks in advance

bbuchfink commented 2 years ago

To compute the query blocks, take the number of DNA letters in the input file * 2, divided by the block size (2000000000 in your case).

ZiliaMR commented 2 years ago

OMG, that means that in my case my analysis have 7.69 query blocks.

My library have = 51307345 sequences 150 pb = 7696101750 letters (pb), so, 7696101750 2 / 2000000000 = 7.69 query blocks

if my calculations are correct, my analysis will take about 6.9 days to process. right?

I will have to adjust some parameters to speed up the analysis process.

bbuchfink commented 2 years ago

Yes seems correct. The easiest way to reduce runtime would to be used a smaller database if that works for you, e.g. the UniRef50 or annottree, see here: https://journals.asm.org/doi/full/10.1128/msystems.01408-21

To get diamond faster you can use -c1 and increase the block size e.g. -b4 or -b6, but this will also increase memory use.

Using global ranking can also help e.g. -g300 combined with -f 6 qseqid sseqid evalue to save time for traceback.

ZiliaMR commented 2 years ago

Thanks for your help.

I am running my analysis using: -c4 and -b8, and it seems to have reduced the time to 1.8 days per library :). I will see your suggestions too.

(i tried with -c1, but I do not have sufficient memory, and I cannot apply for more memory this month at my institution :( ).

Thanks for your help.