Silva missing ranks incorrectly propagated into DADA2-formatted database by `dada2:::makeTaxonomyFasta_SilvaNR()`

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

http://benjjneb.github.io/dada2/

GNU Lesser General Public License v3.0

474 stars 143 forks source link

Silva missing ranks incorrectly propagated into DADA2-formatted database by `dada2:::makeTaxonomyFasta_SilvaNR()` #1293

Open mikemc opened 3 years ago

mikemc commented 3 years ago

A user pointed out to me in https://github.com/mikemc/dada2-reference-databases/issues/1#issue-826059569 problems with some of the taxonomy assignments from our Silva 138 database (https://zenodo.org/record/3986799) in which taxonomy names were sometimes showing up in the wrong rank. I tracked this issue to a problem in the Silva precursor file SILVA_138_SSURef_NR99_tax_silva.fasta.gz. For example, for the genus Anaerococcus,

❯ zcat SILVA_138_SSURef_NR99_tax_silva.fasta.gz | grep "Anaerococcus" | head -n 2                                                                                                                            main
>HL281652.2.1435 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerococcus;unidentified
>AF542233.1.1411 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerococcus;Anaerococcus lactolyticus

the Family field is actually missing from the fasta headers. The end result is that in the user's assignments, the genus Anaerococcus showed up as the family name, and the genus was NA. This problem seems to be fixed in Silva 138.1,

❯ zcat SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz | grep "Anaerococcus" | head -n 2                                                                                                                          main
>HL281652.2.1435 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Anaerococcus;unidentified
>AF542233.1.1411 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Anaerococcus;Anaerococcus lactolyticus

However, I'm wondering if we should modify dada2:::makeTaxonomyFasta_SilvaNR() to catch such errors rather than propagate them when creating our database files from the Silva files. A simple method might be to check that every header has 6 semi-colons.

Edit: The problem with 138 was also in the taxonomy file,

❯ grep "Anaerococcus" tax_slv_ssu_138.txt
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerococcus;        45574   genus           138

❯ grep "Anaerococcus" tax_slv_ssu_138.1.txt
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Anaerococcus;      50192   genus           138.1

Edit: There are also a much fewer set of cases where there were a bunch of extra semi-colons, so we'd want to catch too many semi-colons as well

❯ zcat SILVA_138_SSURef_NR99_tax_silva.fasta.gz | grep "AB511003.1.1389"
>AB511003.1.1389 Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales;YM;S32;TM7;uncultured bacterium

❯ zcat SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz | grep "AB511003.1.1389"
>AB511003.1.1389 Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales;YM_S32_TM7_50_20;uncultured bacterium

mikemc commented 3 years ago

There still seem to be cases in Silva 138.1 where a taxon has fewer than full rank names: E.g. Lutispora https://www.arb-silva.de/browser/ssu-138.1/AB186360

❯ zcat SILVA_138.1_SSURef_tax_silva.fasta.gz | grep "Lutispora" | head -n 2
>HL282693.1.1438 Bacteria;Firmicutes;Clostridia;Gracilibacteraceae;Lutispora;unidentified
>MI373130.1.1438 Bacteria;Firmicutes;Clostridia;Gracilibacteraceae;Lutispora;unidentified

here it seems that there is no Order for the Gracilibacteraceae Family. Or maybe that's the wrong way to interpret Silva taxonomy? I'm not sure what ranks Gracilibacteraceae and Lutispora should be seen as here.

(edited to add missing output)

benjjneb commented 3 years ago

Updated the taxonomic references page to Silva 138.1: https://benjjneb.github.io/dada2/training.html

benjjneb commented 3 years ago

Updating dada2:::makeTaxonomyFasta_SilvaNR() to catch such errors could be good. It's hard to catch everything though, we do rely on the quality of these releases to some degree.

mikemc commented 3 years ago

@benjjneb I now realize I'm confused about how Silva + the dada2 makeTaxonomyFasta_SilvaNR() and taxonomy assignment functions are supposed to work in cases where not all ranks are given to a reference sequence, like the above case of Lutispora, which in NCBI taxonomy is a genus in order Clostridiales, and in Silva is a genus but apparently does not have an order. Can you clarify what is supposed to happen here? I'm not sure if it's even right to think of these as errors in Silva - it seems like they just haven't assigned an order yet.

❯ grep "Lutispora" tax_slv_ssu_138.1.txt
Bacteria;Firmicutes;Clostridia;Gracilibacteraceae;Lutispora;    45387   genus           138

benjjneb commented 3 years ago

My understanding is that the Silva SSURef file is supposed to have a fixed taxonomic order, and that if any entries are not there they are to be assumed to be missing at the lowest ranks. This is assumed in how we process that file.

Possibly this is something to check on in the Silva documentation, or to contact them and ask (they answer queries!).

mikemc commented 3 years ago

My understanding is that the Silva SSURef file is supposed to have a fixed taxonomic order, and that if any entries are not there they are to be assumed to be missing at the lowest ranks. This is assumed in how we process that file.

This seems to not be a good assumption; there are e.g. cases where there is no order or family for a given genus in the tax-map file tax_slv_ssu_138.1.txt file. For Lutispora,

Bacteria;Firmicutes;Clostridia; 1859    class           119
Bacteria;Firmicutes;Clostridia;Gracilibacteraceae;      45385   family          138
Bacteria;Firmicutes;Clostridia;Gracilibacteraceae;Lutispora;    45387   genus           138

there is no order, and as a result dada2:::makeTaxonomyFasta_SilvaNR() makes the header

>Bacteria;Firmicutes;Clostridia;Gracilibacteraceae;Lutispora;

which from what I'm understanding will incorrectly make "Lutispora" the family instead of the genus. I think that probably dada2:::makeTaxonomyFasta_SilvaNR() needs to use the tax map file to figure out that there is a missing rank and add in a rank in. It seems like the old mothur-based process might have already had this from the mothur files, since in the first version of our 138 db this genus ended up with headers like

>Bacteria;Firmicutes;Clostridia;Clostridia_or;Gracilibacteraceae;Lutispora;

benjjneb commented 3 years ago

If cross-referencing with the taxon file solves the issue, that seems like the thing to do.

mikemc commented 3 years ago

I think we have to decide how to fill in the missing ranks; do we leave them as missing,

>Bacteria;Firmicutes;Clostridia;;Gracilibacteraceae;Lutispora;

add a single placeholder for all affected taxa

>Bacteria;Firmicutes;Clostridia;Clostridia_order;Gracilibacteraceae;Lutispora;

which might have been the old mothur-based approach; or do we define a new placeholder "Clostridia_order_1", "Clostridia_order_2", etc for each affected lower-level taxon.

To get a sense of how widespread this problem is, I looked at how many prokaryotic genera in tax_slv_ssu_138.1.txt do not have all ranks (based on semicolon number):

domain	semicolons	n	frac
Archaea	3	1	0.005
Archaea	4	1	0.005
Archaea	5	7	0.036
Archaea	6	187	0.954
Bacteria	4	3	0.001
Bacteria	5	47	0.013
Bacteria	6	3690	0.987

so at least it seems to be affecting less than 2% of prokaryotic genera

benjjneb commented 3 years ago

Yeah this is a relatively minor problem, since probably not all of those <6 semicolon entries are inconsistent with the assumption that missing entries are at lower taxonomic levels.

Still worth fixing though. The easiest way given the current code we already have might be to write a "preprocess" function (that could be called at the start of the current dada2:::makeTaxonomyFasta_SilvaNR()) that inserts placeholders in the right places based on the taxonomy file.

I don't think it is necessary to make up new entries for each missing taxonomic rank. TBH I think just inserting "Unclassified" would be fine, but Clostridia_order or the like would also work.

mikemc commented 3 years ago

Yeah this is a relatively minor problem, since probably not all of those <6 semicolon entries are inconsistent with the assumption that missing entries are at lower taxonomic levels.

This table was generated by looking at the taxonomy strings of genera only, so each one with <6 semicolons should be missing something between domain and genus.

Still worth fixing though. The easiest way given the current code we already have might be to write a "preprocess" function (that could be called at the start of the current dada2:::makeTaxonomyFasta_SilvaNR()) that inserts placeholders in the right places based on the taxonomy file.

I don't think it is necessary to make up new entries for each missing taxonomic rank. TBH I think just inserting "Unclassified" would be fine, but Clostridia_order or the like would also work.

Yea I think either of these or "Unclassified_order" could work and not sure about how important it is to have more or less unique labels. The approach in the QIIME2 Silva files is to duplicate the containing higher-level rank name and prepending with the rank prefix,

AB186360.1.1438 d__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridia; f__Gracilibacteraceae; g__Lutispora; s__Lutispora_thermophila

This is similar to the mothur approach, of using the higher-level rank appended by a rank abbreviation, so it could be worth doing "Clostridia_or" or "Clostridia_order" just to be consistent with mothur and QIIME2.

The code used for the mothur processing is available in the section "Formatting the taxonomy files" here: https://mothur.org/blog/2021/SILVA-v138_1-reference-files/

mikemc commented 3 years ago

The "bad-taxa.csv" files in https://github.com/mikemc/dada2-reference-databases/tree/main/silva-138/v2 and https://github.com/mikemc/dada2-reference-databases/tree/main/silva-138.1/v1 list the problematic taxa for the second version of the 138 database and the 138.1 database

mikemc commented 3 years ago

A fix that requires minimal changing of makeTaxonomyFasta_SilvaNR() could be to insert the following at https://github.com/benjjneb/dada2/blob/45279a29e542a06561562d064c1241282a584e18/R/taxonomy.R#L542

# Get a list of valid names for each Silva rank
rnks <- c("domain", "phylum", "class", "order", "family", "genus", "organism")
silva.taxa$Name <- stringr::str_extract(silva.taxa$Taxon, "[^;]+(?=;$)")
names.split <- split(silva.taxa$Name, silva.taxa$Level)[rnks]

# Work way down the ranks; find taxa with names that don't belong at that
# rank, and shift all the names over by one for those taxa. 
# NOTE: Assumes that an invalid name is b/c from a lower rank
n_ranks <- 6
for (i in seq(n_ranks - 1)) {
  # Find the taxa with bad names at that rank; these should be taxa that are
  # missing that rank, and so the lower rank has been promoted
  bad.idx <- !(taxa.ba.mat[, i] %in% names.split[[i]]) &
    !is.na(taxa.ba.mat[, i])
  # Shift the ranks for those taxa over by one
  taxa.ba.mat[bad.idx, (i+1):n_ranks] <- taxa.ba.mat[bad.idx, i:(n_ranks - 1)]
  # and blank out the missing rank
  taxa.ba.mat[bad.idx, i] <- NA_character_ # or "Unclassified" or "Undefined"
}

# Check that all names are valid for the given rank
for (i in seq(n_ranks)) {
  stopifnot(
    all(taxa.ba.mat[, i] %in% names.split[[i]] | is.na(taxa.ba.mat[, i]))
  )
}

This runs very fast since it's just changing a small fraction of rows. I also experimented with the mothur approach, which manually builds up a matrix of all names in the correct positions. That method is much slower (extrapolating my test suggests 20-30m on the full set of taxa).

This code assumes that there are no cases of extra ranks (like the suborder YM from Silva 138). The YM suborder is getting manually dealt with, and there are no more non-standard ranks in the prokaryote tax in 138.1. And the final check should detect if any nonstandard ranks sneak into future versions (I think).

At least one fix is needed to the ADD SPECIES portion, to account for the fact that the organism/species name is not always in the 7th slot; instead, it is always in the last slot, which might be 6th for the affected taxa. I will attempt a fix later this week and make a pull request if it seems to work.

I'm using stringr here, which is a dependency of reshape2 so should be installed, but I think we can replace that step with base R string functions

mikemc commented 3 years ago

Any opinion on how we should be handling "Incertae Sedis" in the Silva taxonomy? This non-classification appears in 760 of the Bacterial NR99 sequences, often but not always at the terminal rank. E.g. as the class for the order DTU014: "Bacteria;Firmicutes;Incertae Sedis;DTU014;". Arguably it would make sense to replace it with NA like "uncultured". Then there are also cases like "Acidimicrobiales Incertae Sedis".

Edit: Silva 138.1 also has stuff like "Unknown Family" and "possible genus 02" that are getting passed on by dada2::makeTaxonomyFasta_SilvaNR(). I'm not sure how much editing of the official Silva taxonomy is appropriate/desirable.

benjjneb commented 3 years ago

That code seems to work in my hands, however, we can't use NA as the replacement dummy rank, as the NA's are all substituded out of the strings later in the function before writing out the training file: https://github.com/benjjneb/dada2/blob/45279a29e542a06561562d064c1241282a584e18/R/taxonomy.R#L593

My thought is the most "correct" replacement would be "Incertae Sedis", as that means it is not classified at that level and is already used widely by Silva. If we wanted to leave a signature that we had inserted the name ourselves to correct the Silva taxonomy string, it could just be a minor variation on that, e.g. "Incertae_Sedis". That would also allow the stopifnot statement to be easily modified to ignore those corrected entries when validating the updated taxa.ba.mat. I actually kind of like that, because it also allows us or anyone else to easily pull out the ranks at the end that needed correction with a simple grep.

If stringr is definitely not introducing any new dependencies, then it's fine to include, although I would slightly prefer a base replacement for that.

For a pull request, a bit more commentary in the form of inline comments on why this new code has been added would be good.

Edit: Silva 138.1 also has stuff like "Unknown Family" and "possible genus 02" that are getting passed on by dada2::makeTaxonomyFasta_SilvaNR(). I'm not sure how much editing of the official Silva taxonomy is appropriate/desirable.

We can't fix all the Silva oddities. If they include this kind of stuff in their official taxonomy entries, then I think we just pass it on.

I do think we should consider reporting these rank issues in the taxonomy strings to Silva though. It seems like 138 to 138.1 already made progress in reducing these issues, and if we can also provide a full list of taxonomy strings that still show this behavior it could be helpful to them.

mikemc commented 3 years ago

That code seems to work in my hands, however, we can't use NA as the replacement dummy rank, as the NA's are all substituded out of the strings later in the function before writing out the training file:

https://github.com/benjjneb/dada2/blob/45279a29e542a06561562d064c1241282a584e18/R/taxonomy.R#L593

My thought is the most "correct" replacement would be "Incertae Sedis", as that means it is not classified at that level and is already used widely by Silva. If we wanted to leave a signature that we had inserted the name ourselves to correct the Silva taxonomy string, it could just be a minor variation on that, e.g. "Incertae_Sedis". That would also allow the stopifnot statement to be easily modified to ignore those corrected entries when validating the updated taxa.ba.mat. I actually kind of like that, because it also allows us or anyone else to easily pull out the ranks at the end that needed correction with a simple grep.

Yea I figured NA wouldn't work, but I think we should be more obvious about which ranks we filled in than just adding an underscore in an already-used Silva name. At the moment I'm leaning towards "Missing family", "Missing order", etc since the term "[M|m]issing" doesn't appear in any Silva names and will be more immediately informative when it shows up in a plot or analysis to those not familiar with the latin phrase "Incertae sedis" (which I definitely had to google).

If stringr is definitely not introducing any new dependencies, then it's fine to include, although I would slightly prefer a base replacement for that.

For a pull request, a bit more commentary in the form of inline comments on why this new code has been added would be good.

I've become worried that the above is a hack that isn't guaranteed to work in all relevant cases and am leaning towards taking a ground-up approach of creating the Silva training files that addresses the problem and tests the final output more systematically, and trying to squeeze this into the existing DADA2 functions, dependencies, and coding style is too limiting. For now I will develop something in https://github.com/mikemc/dada2-reference-databases that I am convinced is robust and punt or leave it to you to modify the DADA2 functions. Perhaps if we could agree on some unit tests or other sort of final validation I would be more comfortable using just a minor change to the existing dada2 functions.

Edit: Silva 138.1 also has stuff like "Unknown Family" and "possible genus 02" that are getting passed on by dada2::makeTaxonomyFasta_SilvaNR(). I'm not sure how much editing of the official Silva taxonomy is appropriate/desirable.

We can't fix all the Silva oddities. If they include this kind of stuff in their official taxonomy entries, then I think we just pass it on.

This makes sense, but I would like to be consistent or at least to report what the strategy is. Right now, the dada2 code has a step for removing "Uncultured" and "uncultured", but not these other cases. Is that the policy you want to stick to?

I do think we should consider reporting these rank issues in the taxonomy strings to Silva though. It seems like 138 to 138.1 already made progress in reducing these issues, and if we can also provide a full list of taxonomy strings that still show this behavior it could be helpful to them.

Done

benjjneb commented 3 years ago

At the moment I'm leaning towards "Missing family", "Missing order", etc since the term "[M|m]issing" doesn't appear in any Silva names and will be more immediately informative when it shows up in a plot or analysis to those not familiar with the latin phrase "Incertae sedis" (which I definitely had to google).

That's fine as long as it hits the critiera of being understandable, and uniquely greppable in the ultimate training fasta.

For now I will develop something in https://github.com/mikemc/dada2-reference-databases that I am convinced is robust and punt or leave it to you to modify the DADA2 functions.

I don't mind having a dependency on some other package in this function as long as it isn't imported (i.e. isn't needed for package installation). This is a private function that exists to document how we are processing the Silva db, it is not intended for other people to use (or at least non-experts that couldn't easily install the additional package from Github).

However, are you planning to reimplement the entire construction of the Silva training fasta? Or just to fix this Silva rank issue in a way that can be easily integrated into the current function? If the first, that starting to duplicate code and brings up the issue of long-term maintenance.

This makes sense, but I would like to be consistent or at least to report what the strategy is. Right now, the dada2 code has a step for removing "Uncultured" and "uncultured", but not these other cases. Is that the policy you want to stick to?

Yes. Uncultured/uncultured is a special case that is added on to a few hundred Silva entries for no apparent informative purpose. It's an easy one. Anything harder than that I'd prefer leaving as is.

mikemc commented 3 years ago

I don't mind having a dependency on some other package in this function as long as it isn't imported (i.e. isn't needed for package installation). This is a private function that exists to document how we are processing the Silva db, it is not intended for other people to use (or at least non-experts that couldn't easily install the additional package from Github).

However, are you planning to reimplement the entire construction of the Silva training fasta? Or just to fix this Silva rank issue in a way that can be easily integrated into the current function? If the first, that starting to duplicate code and brings up the issue of long-term maintenance.

Let me clarify my thinking here. It seems to me that the current strategy of makeTaxonomyFasta_SilvaNR() is based on an incorrect assumption about the specifications of the FASTA headers in the SILVA_*_SSURef*_tax_silva.fasta.gz files, and is using these in a way that the Silva maintainers do not officially support. A hack such as what I suggest above might fix the problem, but is still a hack and is not guaranteed to be correct (for example, I think "Incertae Sedis" appears in multiple ranks and the above strategy doesn't account for a name appearing at multiple ranks taxonomy). Instead I would prefer to use an approach that is guaranteed to be correct given the nominal specifications of the files provided by Silva.

The approach I have developed is to start from the tree file provided by Silva for the taxonomy. The tips and internal nodes of the tree are labeled by Silva taxids, and should therefore be unique taxa, unlike names that could in principle appear multiple times. The metadata of the taxids are in the tax map file tax_slv_ssu_*.txt. So what I do is use the paths in the tree, get the names from the tax map file, remove the non-standard ranks, and fill in the missing ranks. Then I create the taxonomy strings for headers in the the DADA2 training file from this standardized taxonomy info.

This approach seems to me more robust and is also more flexible - it could be used for example to provide Eukaryotic training files that include the desired ranks). But it is quite different from the current DADA2 approach and requires loading the ape package and using the Silva tree. I also want to implement unit tests and/or functions that validate the final database. Essentially I am trading off short-term effort for the long-term benefit of avoiding other issues that might arise from using the FASTA headers in an undocumented way. Since this is a substantial change, it might make more sense for me to just maintain this on my own, and for you to implement whatever fix you want if you want to maintain your own within DADA2 (or you can adapt my code which I will MIT license)

benjjneb commented 3 years ago

As a note, in correspondence with the Silva maintainers, they indicated that almost all of the bacterial/archael taxonomies in which ranks are "missing" in the taxonomic string will be fixed in the next release.

nbokulich commented 3 years ago

Hey @benjjneb @mikemc , I was led to this issue via the QIIME 2 forum, and looks like I am late to the party. We have been working with the inconsistent taxonomy ranks in QIIME 2 for some time, and @mikerobeson and I have released a python package for formatting the SILVA taxonomy here: https://github.com/bokulich-lab/RESCRIPt

This uses the strategy that @mikemc proposes above:

The tips and internal nodes of the tree are labeled by Silva taxids, and should therefore be unique taxa, unlike names that could in principle appear multiple times. The metadata of the taxids are in the tax map file tax_slvssu*.txt. So what I do is use the paths in the tree, get the names from the tax map file, remove the non-standard ranks, and fill in the missing ranks.

The missing ranks are not the only issue when working with SILVA and taxonomy classifiers that expect even ranks — the eukaryote taxonomy has uneven numbers of ranks and this is another issue that is addressed by @mikerobeson 's reformatting methods.

We also release formatted FASTA and taxonomy files for each of these releases on the QIIME 2 website if you want to use these as source files for creating a dada2-reformatted database. Happy to provide more info if this helps save you some development time.

mikemc commented 3 years ago

@nbokulich Thanks for getting in touch about this. I noticed your work here but couldn't find a Silva 138.1 version (Silva 138.1 significantly reduces the problem, though doesn't eliminate it). E.g. https://docs.qiime2.org/2021.2/data-resources/#marker-gene-reference-databases only lists Sliva 138. Have/do you plan to provide updated versions?

mikerobeson commented 3 years ago

Hi @mikemc, @benjjneb! Both @nbokulich and I have written a nice convenience method within our QIIME 2 plugin called get-silva-data, that will automatically download all of the files for several versions of SILVA and prepare everything for you. From there you can proceed with your own curation either within QIIME 2 or externally if you wish to export.

As of now, get-silva-data supports the following versions of SSU and LSU: 128, 132, 138, 138.1.

We also leave the door open to parse all the other versions, but these may require some additional user parsing. Here you'd simply use parse-silva-taxonomy, and other steps. The tutorials provide examples on how to use these manual and automated approaches.

The default for get-silva-data is currently 138.1, though this was not ready / used at the time QIIME 2 2021.2 or 2021.4 was released (according to the provenance shown in QIIME 2 View). But if you download the latest version of RESCRIPt (using QIIME 2 2021.4), you can follow this tutorial and make your own SILVA 138.1 db .

I'd be happy to chat further!

mikemc commented 3 years ago

@mikerobeson thanks for this very helpful info. @benjjneb what do you think about standardizing on the RESCRIPt-processed Silva files as the precursors for the DADA2-formatted files? I'm imagining this would involve adding yet another set of internal helpers to DADA2 that would process the RESCRIPt output, but the required processing should be fairly minimal.