bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
270 stars 66 forks source link

Classification with refseq but none with refseq_nr #272

Closed Drrachelmoore closed 11 months ago

Drrachelmoore commented 11 months ago

Hello, I have run into an interesting issue. I first ran my data with the refseq database and got good results, but we wanted to see if our metagenomes included any eukaryotes. So I am running it again with the refseq_nr database instead. However, some of the samples that were classified with refseq originally now had 0 bytes of data in the outfile produced with refseq_nr. Aren't these the same essentially? refseq_nr is just refseq + microbial eukaryotes?

pmenzel commented 11 months ago

Hi,

had 0 bytes of data in the outfile produced with refseq_nr

This seems to be an issue with running the program. The output file should always contain one line per read. If the output file is empty, then something went wrong.

refseq_nr is just refseq + microbial eukaryotes?

The refseq database contains only proteins from complete genome assemblies of bacteria and Archaea as well as viruses.

The refseq_nr database is created from the non-redundant protein collection of RefSeq assemblies filtered for bacteria, Archaea, viruses and microbial eukaryotes. Protein sequences are from the files called complete.nonredundant_protein.*.protein.faa.gz from here

Drrachelmoore commented 11 months ago

Okay, thanks! I am not sure why it's running in that issue with only one of my samples but I'll keep working on it.