ADVICE: read abundance at domain/kingdom level?

rjsorr commented 2 years ago

I am hoping for some advice.

basically I want to classify my PE reads to Archaea, Bacteria, Eukaryote and Virus (d: domain with K: kingdom option being optimal) as to get an overview as to the % abundace of each before I focus on just bacteria (what % of the "unclassified" reads to a bacteria database are in fact Archeae, Euakryotes or viruses). As such I don't need to go any deeper in the taxnomical classification, but I'm struggling to find a good method to perfom this and as such I'm hoping someone may have a suggestion?

I have created a database of a single genome for each class within Archaea, Bacteria, Eukaryote and Virus and have tried kraken2 with default parameters. When focusing on bacteria, with a species/strain database I have a higher --confidence and --minimum-hit-groups to avoid false positives. But here I want to do the opposite, I want to force kraken2 to classify a read with low id so it gets a Domain/Kingdom assignment rather than "unclassified" and as such avoid sampling issues of the database. I have done some testing and it seems I cannot relax Kraken2 classification to be able to do this? I was hoping this could be acheived with a realtively small database but maybe the specificty of the program will mean the opposite?

Anyway, I wondering If anyone has been able to acheive this without a database bias? maybe kraken2 is not the correct tool and maybe you have another suggestion? I have tried diamond followed by megan and this gives the type of result I want, but in reality it cannot be used as it uses too much resources even on the smallest of datasets and as such is not scalable. Maybe a 16s+18s database is a better option (conservation plus barcode issues makes me think not)? or even the complete NCBInr (as Diamond --> Megan) with the translated search option?

really hoping someone has a good and robust solution here?

regards

jenniferlu717 commented 2 years ago

I dont think there are any parameters that you can set for kraken2 to achieve this result. You would have to manually create a database. What you can do is try to give all of the bacterial genomes the same taxid in the seqid2taxid.map file (and do the same for all of the other domains/kingdoms) but I don't have a better method for you.

I think you have to include all of the genomes possible. I don't think you can include only representatives from each clade.

rjsorr commented 2 years ago

Thanks for the reply @jenniferlu717 , actually in the end I made a protein database of the complete NCBInr and classified with default parameters as I just wanted high taxonomic classification anyway (later using NCBI nt an strigent parameters to classify to the species level of focus groups) . Obviously still an estimate / best guess of abunace in each kingdom, but the best soultion I could find that gives the least database bias (increased eukaryote : prokaryote ratio compared to nt database) and worked in a realtively ok time-frame. Actually, as the result is more based on trancripts compared to genomes for eukaryotes it gives a more inflated (mostly fungi), and probably more realistic, interpretation of their true abundace levels in microbiomes.

DerrickWood / kraken2

ADVICE: read abundance at domain/kingdom level? #600