Formatting custom database including "dark matter" microbes

jgrzadziel commented 5 years ago

Hi everyone! Recently I was working on a huge amount of 16S amplicon data from various environments. It is not a surprise that in some (most of...) samples the unclassified taxa (at least at genus level) was dominant. Sometimes it was about 30, sometimes 80% of the entire microbiome.

DADA2 is an excellent tool to process all steps in the bioinformatic analysis of fastq files. To date, I was used to assigning taxonomy with RDP N.B. classifier (included in the DADA2 package).

However, it is commonly known that it is not good enough to process unculturable taxa which are not present in the databases (so it is not good enough for the majority of taxa in my case). Most of that reads are overclassified, while they are not even closely related to any taxa in the database.

So, I have started using idTaxa from DECIPHER package, which is great for that purpose (however it is also prone to some errors and not so perfect, but currently the most reliable). It appeard that using idTaxa, there is even more "dark matter" fraction of microbes in my samples.

And here are the problems I have noticed with any workflow:

1. Further processing (e.g. using phyloseq) should be done for aggregation of the resulted taxonomy/otu table. This step includes taxonomy aggregation based on each taxonomic level name.

When it comes to fully classified taxa, e.g. Bacillus, there is no problem, because every "hit" of Bacillus is equal and these sequences are mostly identical (~99%), and the count table is simply summarized which is ok.

order	family	genus	B1	B2	B3
Bacillales	Bacillaceae 1	Bacillus	0	20	0
Bacillales	Bacillaceae 1	Bacillus	5	10	0
Bacillales	Bacillaceae 1	Bacillus	22	0	7

But when it comes to some unclassified reads for example:...

order	family	genus	B1	B2	B3
unclassified	unclassified	unclassified	3566	2348	4417
unclassified	unclassified	unclassified	0	6	11
unclassified	unclassified	unclassified	1775	1517	1491

... in most cases, the sequences are completely different (mostly less than 60-80% similarity), however, they "names" match to each other and then each of "unclassified" read is also simply summarised resulting in the final "unclassified" group with summarised abundance. And here we lost a lot of information. Moreover, for that "summarized" taxa, one representative sequence is chosen by a script, making another confusion. In that situation, you can wrongly interpret your results for example:

you see that 30% of your sample is mysterious "unclassified" read
then you blast it (e.g. on ncbi)
Voila! You point in your publication that you have discovered some bacterium that is unculturable, some researchers also found this bacterium in various environments, but in your sample, it accounts for 30% !! So this mysterious microbe must be special and you need to look closer, maybe even try to culture it!
well...this is all wrong and the true is that 30% of you sample is occupied by unclassified microbies...but not by this one dominant , instead of, by 1500 different "low-abundance" unclassified microbes.

Here's how it looks now:

order	family	genus	B1	B2
unclassified	unclassified	unclassified_JG_F_00014	4607	5354
unclassified	unclassified	unclassified_JG_F_00055	4135	2230
unclassified	unclassified	unclassified_JG_F_00012	4079	3560

2. Another problem: Each time you sequence different samples, the same "dark matter" microbes are found but you lost the information about them every single time. They are just thrown away and you didn't even notice that there is a pattern for example:

every acidic soil contain the same ~20% fraction of the same unclassified reads (based on sequences identities), which will probably in future help to describe some core biomes for any environment, based on that "dark matter"/unclassified reads.

3. What I have tried to do with that issue:

I have written a script that modifies the reads processing after the assignment step (very draft and ugly version) in the following steps:

a) before taxonomy aggregation a unique identifier is assigned to each unclassified read b) these reads are clustered based on 99% sequences similarity c) a unique identifier is assigned to each unclassified "cluster" d) the counts of each unclassified reads are aggregated based on the same cluster membership e) the count table of classified reads is merged with the new "unclassified clusters" count table

Now the result table makes sense: I still have 30% (or more) of unclassified reads, but not as one random representative, instead, I have a few hundreds of low-abundance unclassified representatives which reflects the true microbial diversity. Now the statistics are more reliable and e.g. PCA analysis is not disrupted by our limited taxonomy knowledge. These results should be consistent with future calculations when more taxa will be identified.

Before: (sorted by B1, decreasingly)

class	order	family	genus	B1	B2	B3
unclassified	unclassified	unclassified	unclassified	27.30461	40.75292	30.53209
Actinobacteria	Actinomycetales	unclassified	unclassified	17.55667	10.30076	20.94526
Acidobacteria	unclassified	unclassified	unclassified	17.08071	12.08076	19.79791
Acidobacteria	unclassified	unclassified	Gp1	7.900283	11.20099	4.38016
Actinobacteria	Actinomycetales	Mycobacteriaceae	Mycobacterium	6.699606	3.264062	6.833538

After: (sorted by B1, decreasingly)

class	order	family	genus	B1	B2	B3
Acidobacteria_Gp1	unclassified	unclassified	Gp1	7.900283	11.20099	4.38016
Actinobacteria	Actinomycetales	Mycobacteriaceae	Mycobacterium	6.699606	3.264062	6.833538
unclassified	unclassified	unclassified	unclassified_JG_F_00014	3.979511	3.912198	5.55541
Acidobacteria_Gp1	unclassified	unclassified	unclassified_JG_F_00055	3.571799	1.629474	4.34703
unclassified	unclassified	unclassified	unclassified_JG_F_00012	3.523426	2.601312	4.353133

### And here is when finally I'm asking for help I want to create a custom database for example on the basis of RDP database (or GTDB), which would include these unclassified reads with my identifiers. It would help to assign reads from other samples using the continuously updated database to identify the unclassified reads which are present repeatedly in different experiments.

And here I have two problems (depending on whci classifier I want to use: assignTaxonomy from dada2 or idtaxa from decipher):

assignTaxonomy(...) expects a training fasta file (or compressed fasta file) in which the taxonomy corresponding to each sequence is encoded in the id line in the following fashion (the second sequence is not assigned down to level 6):

There is information that you can include taxa classified to higher levels (e.g. down to the family, but not to genus) but how to include taxa classified to lower level (to the genus level) without having information about higher levels (family and higher) ?

A similar problem is connected with creating a custom database using decipher. In that situation, if I replace the higher levels with "unclassified" like this:

dunclassified;punclassified;cunclassified;ounclassified;funclassified;gunclassified_JG_F_00019

The classifier fails, telling me that I have repeated "unclassified" value at different levels.

Then when I remove all higher taxa like this:

d;p;c;o;f__;g___JG_F_00019

or

;;;;;;JG_F_00019

or simply

;JG_F_00019

The classifier is not failing, but "JG_F_00019" is treated as a domain, not a genus, because it is the first "not empty" value. And classifying makes no sense now.

I would appreciate any help.

benjjneb commented 5 years ago

There is information that you can include taxa classified to higher levels (e.g. down to the family, but not to genus) but how to include taxa classified to lower level (to the genus level) without having information about higher levels (family and higher) ?

For the assignTaxonomy training fasta, you can do this by just including placeholders for the higher taxonomic levels (e.g. "unclassified"). This may or may not achieve exaclty what you want in subsequent phyloseq glomming steps, but it shoudl work fine, e.g. your example id line

>d__unclassified;p__unclassified;c__unclassified;o__unclassified;f__unclassified;g__unclassified_JG_F_00019

should work with assignTaxonomy (I think).

Also, have you seen this recent preprint on the "autotax" methodology? It seems to be trying to do the same thing you are: https://www.biorxiv.org/content/10.1101/672873v2

jgrzadziel commented 5 years ago

Thank you for advice @benjjneb ! I performed the classification using my custom database and it works perfectly using assignTaxonomy! It was formatted as you mentioned:

dunclassified;punclassified;cunclassified;ounclassified;funclassified;gunclassified_JG_F_00019

However, idTaxa (decipher) failed because of repeated taxa names on different levels (the same = unclassified). Here the problem is that idTaxa breaks the taxonomy to each level, and each taxon must be unique.

So when I format the database as above, the classifier fails because of repeated taxa names.
When I leave "unclassified" and keep it like this:

d;p;c;o;f__;g___JG_F_00019

the classifier also fails and additionally empty taxa names are created.

Finally, when I just leave:

g__JG_F_00019

The classifier works, but each genus formatted like this is assigned as domain, not genus.

Thank you also for the very interesting preprint!

benjjneb commented 5 years ago

@digitalwright any thoughts on this database formatting question for IDTaxa?

marcomeola commented 5 years ago

Dear @jgrzadziel, thank you for presenting your problem so well. That's an issue we have been facing while developing the dairydb (https://github.com/marcomeola/DAIRYdb), a custom 16S database for bacteria and archaea from dairy products. Since in dairy products it is essential to get classification down to species level, we didn't want to leave unknowns and uncultured species as it stands, but we have decided to fill up the the ranks down to the species level picking up the last common rank. Doing so, there are no totally unknowns at species level. As an example, if the last common rank known is the family Lactobacillaceae, then the genus and species ranks are filled _LactobacillaceaeGenus and _LactobacillaceaeSpecies in the database, respectively. That allows not to merge the abundance values of all unknowns altogether. There will always be some ASVs you can only classify to one of the higher ranks, even with IDTAXA, which is an excellent tool. You would finally need to fill up the classification of the ASVs as described above and then import them into a phyloseq object. Hope that helps.

jgrzadziel commented 5 years ago

@marcomeola Thank you for the advice! This is a pretty nice solution.

But still, I can see some problems: 1) If the sequence is unclassified at every level (even Kingdom/Domain, which is not so uncommon) , there is no higher level to pick up

dunclassified;punclassified;cunclassified;ounclassified;funclassified;gunclassified

2) If there are many unclassified sequences at e.g. genus level from the same family, they should have additionally unique numeric identifiers to avoid aggregating them in subsequent processing e.g.:

(...)fLactobacillacea; Lactobacillaceae_Genus_id0001 (...)fLactobacillacea; Lactobacillaceae_Genus_id0002 etc.

marcomeola commented 5 years ago

ASVs with no annotation at domain/kingdom level feels suspicious to me. Did you blast the sequence to check whether it really is of either archaea or bacteria?

jgrzadziel commented 5 years ago

@marcomeola I know this is suspicious, and this was not a problem when I used to classify via RDP classifier, since 100% of reads were always assigned at least to the domain/kingdom (either bacteria or archaea).

The situation has changed when I moved to idTaxa (decipher), which is known to be less prone to overclassification errors. And since then I've found that in some samples I have more than 3000 unique sequences that not belong to any known taxa even up to the domain/kingdom level.

I repeated the classification using different parameters in R, also using online classifier available here. Most of these 3000 reads are repeatedly not classified to the domain/kingdom.

When I blast them, the highest sequence similarity to any sequence in NCBI database is less than 98%, sometimes below 95 or less. The results taxonomy tree shows that some of those sequences belong to "Unkown rich in GC G+ group".

The bootstrap for the domain is sometimes as little as ~20, as well as for rootrank (also ~ 20).

Here you can find fasta file when I filtered only sequences that previously were unclassified to the domain/kingdom level. Interestingly when I classify them again (using the same database and parameters with idtaxa) some of them are positively classified to lower ranks...

Unclassified_Kingdom.txt

Here is the idtaxa classification result: RDP_1570701818183844233.txt

And just a few examples:

OTU_unclassified_domain_0212 Root (rootrank, 21%); unclassified_Root (domain, 21%) OTU_unclassified_domain_0562 Root (rootrank, 15%); unclassified_Root (domain, 15%)

@digitalwright - I would appreciate if you could have a look on this issue... Thanks :)

marcomeola commented 5 years ago

@jgrzadziel, just keep in mind that idtaxa is a classification tool and its performance highly depends on the database you use.

digitalwright commented 5 years ago

A few points to add to the discussion...

The RDP Classifier will even assign high confidence at the domain level to many random sequences. This is expected given the way the algorithm is formulated, especially if the taxonomy only has two options at the domain level (i.e., bacteria and archaea).

IDTAXA is not susceptible to this problem. That is, it will only classify with high confidence when there are sequences considerably like the the query sequence in the reference set. Again, this is a feature of the algorithm's construction.

Ultimately, the reference training set must include sequences with some degree of similarity to the query sequence. If this is not the case then the result will be Root; unclassified_Root when using IDTAXA.

All databases are not equal in their breadth/depth of references sequences. When I run your Unclassified_Kingdom.txt sequences through IDTAXA with the GTDB training set, about 75% are assigned to the bacterial domain. With the SILVA training set this increases to about 90%.

benjjneb / dada2

Formatting custom database including "dark matter" microbes #853