bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
261 stars 68 forks source link

SEG filtering on the nr protein database? #51

Closed toddknutson closed 6 years ago

toddknutson commented 6 years ago

Hi,

Thanks for Kaiju, it's very nice!

Is the SEG filtering for low-complexity sequences performed on the nr database when makeDB.sh is run? I understand that I can add this functionality to my input sequences when running kaiju -x, but I'm wondering if the nr database has also been filtered? If not, would you suggest running SEG independently on the nr sequences before building the kaiju database?

Thanks!

pmenzel commented 6 years ago

Hi,

the SEG filter is run on the fly for the each amino acid fragment that is searched against the selected reference database (refseq, nr, progenomes), but not on the database itself by makeDB.sh.

In principle you can run it independently on the downloaded nr database, but then you need to remove the filtered sequences and not just mask them. The reason is that the BWT and FM index used in kaiju make use of only the 20 uppercase letters of the standard amino acid alphabet.

But I think it will have little effect having both the DB and the reads filtered though..

Peter

toddknutson commented 6 years ago

Okay, great, thanks for the explanation!

Todd

pmenzel commented 6 years ago

No problem :)