Retrieve Unclassified Sequences

mariferrarini commented 5 years ago

Dear authors,

First of all, I have to say that kodoja is working perfectly for my purposes, really good job!

I would like to know whether there is a functionality that allows me to list or retrieve the sequences that were unclassified? I don't know if this is usual, but in my case I am having something ranging from 5 to 20% of unclassified sequences.

Also, I gave as input an animal (only dna) and a bacteria (both dna and pep) as "host" genomes that I know for sure are present in the fastq; however only kraken was able to detect the presence of the bacteria. Is this somehow expected in some cases? Could this issue be related to the unclassified sequences too?

Thank you again for the great tool. Mariana.

mariferrarini commented 5 years ago

Maybe I've found the answer: But just to confirm: if I filter in kodoja_VRL taxID 0 would it give me the unclassified seqs? Or, in kodoja_retrieve --taxID 0 gives me exactly that or does it mean something else? Thank you.

peterjc commented 5 years ago

I don't think that kodoja_retrieve currently has the functionality to pull out unclassified reads - just the option to restrict to particular taxa.

Using kodoja_retrieve --taxID 0 ... wasn't explicitly considered (there is no NCBI taxid zero, the tree root is node 1), and looks to behave the same as leaving out the --taxid 0 option. i.e. The retrieve script does not filter by taxonomy, giving you everything:

https://github.com/abaizan/kodoja/blob/kodoja-v0.0.10/diagnosticTool_scripts/kodoja_retrieve.py#L79

You would probably have to post process the output to get the unclassified reads.

As to the Kraken only results, quoting the https://github.com/abaizan/kodoja/wiki/Kodoja-Manual opening paragraph:

Kodoja is a bioinformatics workflow that takes RNA-seq data files and uses k-mer profiling to identify virus sequences that are present. It combines two existing tools, Kraken (1) for taxonomic classification using k-mers at the nucleotide level and Kaiju (2) for sequence matching at the protein level.

Kaiju has only been given bacterial proteins (you didn't have any animal proteins). Kraken has been giving both bacterial and animal DNA, so ought to find matches to both. Are you giving it whole genome, or perhaps just the animal coding sequences?

You might have to try running kraken directly on the same database and sequences to confirm that behaves the same - at least it should help identify were the problem is (e.g. the DB construction).

mariferrarini commented 5 years ago

Thank you for the clarifications. I don't know if I understood the question but I have provided the bacterial (whole genome + coding pep) and animal (whole genome) separately, with different taxids. I will try to run both hosts separately and will try to include coding peptides of the animal in the next run to check if they are behaving the same way as running together. I will be able to let you know if I have any other results in the next few weeks. Thank you again.

peterjc commented 5 years ago

OK, good luck. Kodoja was written and tested with plant host genomes, but I don't see why it shouldn't work here too.

abaizan / kodoja

Retrieve Unclassified Sequences #42