bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
260 stars 68 forks source link

Segfault using proGenomes #12

Closed koopkaup closed 7 years ago

koopkaup commented 7 years ago

Hi, When I use proGenomes as reference then running kaiju gives me segafult (greedy and MEM options both) within seconds. I ran it on a HPC cluster using 32gb RAM and 8 cores. My sample files are Illumina 2x150 bp paired-end fasta files (about 2,5gb each). 24855 Segmentation fault kaiju -t ~/kaiju/DB/nodes.dmp -f ~/kaiju/DB/kaiju_db.fmi -i trimmed-fasta/K1.R1.trimmed.fasta -j trimmed-fasta/K1.R2.trimmed.fasta -o K1.kaiju.tax.out -z 8 -a greedy -e 10

However when I used RefSeq as reference then it worked. Any ideas what could be the problem or should I change some parameters?

Thanks

pmenzel commented 7 years ago

Hi, hm that's a bit hard to debug from here, it was working fine for me, but I will test it again..

Did the makeDB.sh run through without any errors and the kaiju_db.fmi is around 12GB?

You could try with -z 4 instead of -z 8. The reason is that Kaiju will use 9 threads in total when specifying -z 8 (8 threads for classification and one for reading the input), and some HPC queuing system will terminate a job, when it exceeds the allotted resources.

(Also -e 10 seems a bit much for these kind of short reads..)

Peter

koopkaup commented 7 years ago

Yes, the problem was with the index file. Although makeDB.sh ran without any errors, the size of the .fmi file was about 3 GB because it ran out of disk space. Is it possible to add a check in the index making script that when there is no more space left on the disk it gives an error?

And also I checked the sizes of RefSeq and proGenomes sequences (kaiju_db.faa). The shortest sequence of RefSeq is 5 aa (header >68_565050) and proGenomes 1 aa (>684010_1238237) in length. Am I correct that these sequences are not used for index construction? What is the cutoff value for discarding short sequences?

Kristjan

pmenzel commented 7 years ago

Good that it works now. All sequences in the input file are used for index construction. But during the search all sequences below the minimum match length, which defaults to 11 aa, are discarded.