bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
260 stars 68 forks source link

Slow reading nr database #31

Closed davidvilanova closed 7 years ago

davidvilanova commented 7 years ago

Hi, I have downloaded latest indexes from the Kaiju website and running paired-end classifification of a sample reads. It´s strange but it takes almost two hours to complete the reading of the files (see time below). I´m running kaiju using a linux cluster with SGE (setting mem and threads parameters as follows: -l h_vmem=60G and -pe parallel_smp 10).

Why is this taking so long ??

kaiju -v -z 10 -a greedy -m 5 -s 70 -x -t /db/kaiju/nodes.dmp -f /db/kaiju/kaiju_db_nr_euk.fmi -i 1.contigs2reads.R1.clean.fq -j 1.contigs2reads.R2.clean.fq -o 1.out.ncbi_id ;

11:55:10 Reading database
 Reading taxonomic tree from file /db/kaiju/nodes.dmp
 Reading index from file /db/kaiju/kaiju_db_nr_euk.fmi
Output file: files/taxonomy_reads_mapped_to_contigs//1/1.1.contigs2reads.R2.fq.ncbi_id
13:50:27 Start classification using 10 threads.
13:56:45 Finished.
pmenzel commented 7 years ago

Hi,

yeah it should not take 2 hours to read that file from the disk.

For my tests, it takes 2 minutes on a normal server, where the file is on a local raid5 of three disks. Also you have specified 600GB of RAM, so that should be enough.. How is the speed if you just copy the fmi file within that filesystem?

pmenzel commented 7 years ago

btw, if you use -a greedy you also need to set -e to a value higher than 0, otherwise no substitutions will be allowed.

davidvilanova commented 7 years ago

Ok thanks, will go talk to the linux admin person to see if we can fix this issue.

davidvilanova commented 7 years ago

Ok figure out what was happening. We are using both intel and amd processors in our cluster through the SGE queuing system. Finally by selecting only the intel processors ("-hard -l myarch=intel") everything works fine.