Question on kaiju classification from PacBio data

DaRinker commented 2 years ago

Looking for insight/confirmation that Kaiju is operating as expected. My results thus far are suggesting that kaigu is misclassify reads that are very string BLAST hits to divergent taxa.

Specifically, I ran kaiju using the nr_euk database over ~300000 Pacbio reads using kaiju default parameters. I just happed to spot-check a 6.6kb PacBio read that was classified by kaiju as best matching txid 36630. However that taxon doesn't appear even among the 100 top-BLAST hits (NCBI nr) and the top NCBI blast hit (88% coverage with 99% sequence identity) corresponds to NCBI:txid5061 (a divergent taxa that should also be present in the nr_euk)

Conversely, if I then take another read that Kaiju has classified as belonging to that divergent taxa (i.e. txid 5061), it actually BLASTs very strongly (99% match over 7.7kb) to a different taxon (NCBI:txid746128). That taxon is also very divergent from the taxon assigned by kaiju.

Any ideas what might be going on? Might kaiju be having problems with these long reads? I'm now concerned that I cannot trust any of the classifications I'm getting back.

EDIT: Am guessing I need to adjust some default parameters. Will begin with specifying a higher minimum score EDIT2: Did not fix the problem. I tried running kaiju with more stringent score requirements (-s 500 and -s 1000) and those resulted in either the same misclassification or no classification (respectively). Any advice on what to try next? EDIT3: I have now tried truncating all my PacBio reads to just the first 500bp (under the assumption that the long reads were somehow "breaking" kaiju. This (sort of) helped and I'm no longer seeing flagrant misclassifications; however, I'm now only able to classify ~25% of my reads, so still not there yet.

pguenzi-tiberi commented 1 year ago

Hello @DaRinker,

Have you continued your trials with Kaiju and long reads ? Did you manage to improve the classification scores ?

DaRinker commented 1 year ago

No. I decided it was the wrong tool for the job. Maybe there have been developments since? Right now I'm most confident when using Kaiju on Illumina reads.

pguenzi-tiberi commented 1 year ago

Thank you very much for your reply !!!

bioinformatics-centre / kaiju

Question on kaiju classification from PacBio data #246