cfrancoeur commented 1 year ago

Hi,

I was wondering if there was going to be a new DADA2-formatted GreenGenes database based on the new release (https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291)? I was looking through the raw files (http://ftp.microbio.me/greengenes_release/current/) and wasn't sure which one would be appropriate for assigning taxonomy to the V4 region in DADA2.

Thank you, Charlotte

benjjneb commented 1 year ago

I just learned there is a new GreenGenes release! Briefly skimming, what is needed is formatting the full-length section of the new GG database (i.e. without the millions of short inserted V4 fragments) into the DADA2 fasta reference format.

We had soft-deprecated GG here for seemingly being abandonware, but probably we want to again support an official version of this. That said, we are open to community support here, and would be happy to elevate a reformatted database that works with assignTaxonomy provided the reformatting code is publicly available, and the database passes our internal testing.

cfrancoeur commented 1 year ago

That makes sense! I was able to export the full length sequences and the taxonomy from the .qza format to regular FASTA/.tsv files (using the qiime tools export command). I'm going to try to replace the barcode IDs in the FASTA file with the taxonomic information in the taxonomy.tsv file and that would be in the DADA2 format. I'll admit though, I'm not the best at coding and this will take me a while to figure out, so if this is some thing you would be interested in doing and would do it soon, let me know and I would be happy to give you the converted files, if that's helpful.

cfrancoeur commented 1 year ago

Actually using this command from seqkit, I think I got it (https://www.biostars.org/p/494900/): seqkit replace -p "(.+)" -r '{kv}' -k taxonomy.tsv dna-sequences.fasta Here is the folder with three items for you to double check (if you have time): -dada2.greengenes.fna = the output file from the seqkit command -dna-sequences.fasta = qiime exported 2022.10.backbone.full-length.fna.qza -taxonomy.tsv = qiime exported 2022.10.backbone.tax.qza

Please let me know if I did something wrong or if this is entirely the wrong direction. Thanks!!

Edit: I realized that in your formatted training files there are no spaces or class information, so I still have to figure out how to remove those. But assignTaxonomy did work with this file (and seems to largely check out with the genera I got with the silva database).

DrJButcher commented 1 year ago

Hi all, I'm also interested in using the updated GreenGenes2 database for use in dada2.

I've written some R code (below) that attempts to download/format the raw files into the correct format using a single workflow. I'm not sure about a few things:

Can/should non-ACTG bases be included in the database? I believe that they will be ignored for assignTaxonomy, but I'm not sure if the same is true for an addSpecies database
I created a potential addSpecies databases from greengenes as well. However, many of the listed sequences are not defined down to the genus or species level, so I only included those that did have this level of annotation. Would this be correct or should the addSpecies database have all the sequences and dummy level variables for the genus/species?

`#required for creating the id line for each entry library(data.table)

reading/writing fasta

library(Biostrings)

downloading qza files

library(curl)

optional: compressing the resulting fasta files

library(R.utils)

optional: converting windows line endings to unix

library(TAF)

set the name of the sequence file

gg2.seq.qza="2022.10.backbone.full-length.fna.qza"

download the qza file using curl_download

curl_download("http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.full-length.fna.qza", gg2.seq.qza)

qza flies are simply zip folders, get the contents using unzip

gg2.seq.qza.contents=unzip(gg2.seq.qza, list = TRUE)

get the path to the fna file in the qza artifact

gg2.seq.file=gg2.seq.qza.contents$Name[grepl(".fasta",gg2.seq.qza.contents$Name)]

import the fna sequences as a dnastring set

gg2.seqs=readDNAStringSet(unzip(gg2.seq.qza,files = gg2.seq.file))

get the number of different bases present in each sequence

gg2.seqs.baseFreq=alphabetFrequency(gg2.seqs)

there are some sequences with non-standard bases present, not sure if these should be removed from the assignTaxonomy or addSpecies files

gg2.seqs.presence=apply(gg2.seqs.baseFreq,2,sum) gg2.non.standard.bases=names(gg2.seqs.presence)[!names(gg2.seqs.presence)%in% c("A","C","G","T") & gg2.seqs.presence>0]

set the name of the taxonomy file

gg2.tax.qza="2022.10.backbone.tax.qza"

download the qza file

curl_download("http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.tax.qza", gg2.tax.qza)

get the contents and path to taxonomy file as above

gg2.tax.qza.contents=unzip(gg2.tax.qza, list = TRUE) gg2.tax.file=gg2.tax.qza.contents$Name[grepl("taxonomy.tsv",gg2.tax.qza.contents$Name)]

import the file as a data.table

gg2.tax.dt=fread(unzip(gg2.tax.qza,files = gg2.tax.file),check.names = TRUE)

convert the sequences into a data.table to merge with the taxonomy information

gg2.seq.dt=data.table(Feature.ID=names(gg2.seqs),sequences=as.character(gg2.seqs))

merge the two data.tables together

gg2.dt=gg2.tax.dt[gg2.seq.dt,on=.(Feature.ID)]

the greengenes taxonomy uses the standard 7 level scheme; here I put Domain instead of Kingdom to better match their d__ designation

gg2.tax.levels=c("Domain","Phylum","Class","Order","Family","Genus","Species")

split up the Taxon string on semicolons to get each level as its own column

gg2.dt[,c(gg2.tax.levels):=tstrsplit(Taxon,split="; ")]

remove the leading x__ from the greengenes taxonomy levels

this results in those sequences which are undefined at a level having an empty string for these entries

gg2.dt[,c(gg2.tax.levels):=lapply(.SD,function(x) sub("^[dpcofgs]__","",x)),.SDcols=gg2.tax.levels]

create the taxonomy string for dada2 assignTaxonomy; this is the first 6 levels separated by semi-colons

gg2.dt[,assignTaxonomy.id:=paste(Domain,Phylum,Class,Order,Family,Genus,sep=";")]

sequences that are not defined to the Genus level will have empty entries and thus ";;" or ";;;" at the end of the taxonomy string

gg2.dt[grepl(";;;",assignTaxonomy.id),-c("sequences")]

these can be removed using gsub and anchoring the regex to the end of the string

gg2.dt[,assignTaxonomy.id:=gsub(";*$","",assignTaxonomy.id)]

note that there is one sequence that is missing part of their middle taxonomy

gg2.dt[grepl(";;",assignTaxonomy.id),]

this sequence appears to be from NCBI; but also has a different taxonomy listed as what is currently listed in greengenes2

https://www.ncbi.nlm.nih.gov/nuccore/AY327251

greengenes2 now saves the binomial genus/species designation in the species column so we only need to paste together feature.id species column

however, not all sequences have a defined genus and species designation, for those sequences, put a value of NA in the addSpecies.id column

gg2.dt[,addSpecies.id:=paste(Feature.ID,Species)] gg2.dt[Genus=="",addSpecies.id:=NA] gg2.dt[Species=="",addSpecies.id:=NA]

create a data.table containing the sequences/ids for assignTaxonomy; here I removed the sequence with the missing inner taxonomy

gg2.assignTaxonomy=gg2.dt[!grepl(";;",assignTaxonomy.id),.(sequences,assignTaxonomy.id)]

use the columns in this data.table to generate a DNAStringset that can be written as fasta

gg2.assignTaxonomy=DNAStringSet(x=setNames(gg2.assignTaxonomy[,sequences],nm=gg2.assignTaxonomy[,assignTaxonomy.id])) gg2.assignTaxonomy

write out stringset to be used for assignTaxonomy

writeXStringSet(gg2.assignTaxonomy,"2022.10.backbone.full-length_assignTaxonomyDADA2.fna")

optional: convert windows style line endings to unix; requires TAF

TAF::dos2unix("2022.10.backbone.full-length_assignTaxonomyDADA2.fna")

compress the output

gzip("2022.10.backbone.full-length_assignTaxonomyDADA2.fna")

create a data.table for addSpecies, here I excluded taxa that do not have a bimodial genus/species designation

I also excluded the same taxa from above that was missing the middle taxonomy

gg2.addSpecies=gg2.dt[!is.na(addSpecies.id) & !grepl(";;",assignTaxonomy.id),.(sequences,addSpecies.id)]

use the columns in this data.table to generate a DNAStringset that can be written as fasta

gg2.addSpecies=DNAStringSet(x=setNames(gg2.addSpecies[,sequences],nm=gg2.addSpecies[,addSpecies.id])) gg2.addSpecies

write out the stringset

writeXStringSet(gg2.addSpecies,"2022.10.backbone.full-length_addSpeciesDADA2.fna")

optional: convert windows style line endings to unix; requires TAF

TAF::dos2unix("2022.10.backbone.full-length_addSpeciesDADA2.fna")

compress the output

gzip("2022.10.backbone.full-length_addSpeciesDADA2.fna")

clean up the extracted files; leave the qza files intact

unlink(unique(dirname(c(gg2.seq.qza.contents$Name,gg2.tax.qza.contents$Name))),recursive = TRUE) `

greengenes2_dada2_formatting_230130.txt

asierFernandezP commented 1 year ago

Hi,

Is there any progress on implementing this? I would also be interested in using Greengenes2 in DADA2.

Asier

Dx-wmc commented 1 year ago

I noticed that the paper by greengenes2 has been published in nature biotechnology (https://www.nature.com/articles/s41587-023-01845-1). It seems that supporting this database with data2 makes more sense. I hope to see data2 support greengenes2 as soon as possible.

benjjneb commented 11 months ago

Here is a script for re-formatting Greengenes2 for use with DADA2 assignTaxonomy. Should just run:

library(Biostrings)
library(ShortRead)
library(dada2)

setwd("~/Desktop/GG2/final") ### CHANGE ME...

download.file("http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.full-length.fna.qza", 
              "2022.10.backbone.full-length.fna.qza")
download.file("http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.tax.qza",
              "2022.10.backbone.tax.qza")
unzip("2022.10.backbone.full-length.fna.qza")
unzip("2022.10.backbone.tax.qza")
fn <- "a53d9300-5c5c-4774-a2e8-a5e23904f1ae/data/dna-sequences.fasta"
txfn <- "c16a953c-f24d-4d14-927c-40d90ced395e/data/taxonomy.tsv"

# tx is 2 column, id and taxonomy, taxonomy is 7 level, domain -- species

sq <- getSequences(fn)
tdf <- read.csv(txfn, sep="\t", header=TRUE)
tax <- tdf[,2]
names(tax) <- tdf[,1]

identical(names(sq), names(tax)) # TRUE

# Parse the taxonomies from the id string
taxes <- strsplit(tax, "; ")
tax.depth <- sapply(taxes, length)
table(tax.depth) 
# All taxonomices are 7-level
# Note: A significant number of unassigned taxonomic levels here, which are encoded as e.g. "g__"
# Note: Species has genus name duplicated, i.e. species level has genus [SPACE] species binomial, rather than just species

# Remove the genus from the species assignment
for(i in seq(length(taxes))) {
  gen <- taxes[[i]][[6]]
  gen <- substr(gen, 4, nchar(gen))
  taxes[[i]][[7]] <- gsub(gen, "", taxes[[i]][[7]])
  taxes[[i]][[7]] <- gsub("__ ", "__", taxes[[i]][[7]])
}
# Identify unassigned taxonomic levels
tax_pre <- c("d__", "p__", "c__", "o__", "f__", "g__", "s__")
is.unassigned <- sapply(taxes, function(tx) {
  tx == tax_pre
}) |> t()
# Note: A relatively small number of entries have lower taxonomic levels assigned, even though higher taxonomic levels aren't assigned
# e.g. `MJ030-2-barcode58-umi83452bins-ubs-6`, which is assigned s__Spirochaeta aurantia, but "o__;f__;g__" for order/family/genus designation
# These assignments will be dropped in accordance with `assignTaxonmy` expectations
tax.depth <- apply(is.unassigned, 1, function(isu) { min(which(isu)-1L, 7L) })
tax.ids <- sapply(seq_along(taxes), function(i) {
  td <- tax.depth[[i]]
  id.str <- paste(taxes[[i]][1:td], collapse=";")
  id.str <- paste0(id.str, ";") # Add terminal semicolon
  id.str
})
names(tax.ids) <- names(taxes)
# Write out the training fasta file
sq.out <- sq
names(sq.out) <- tax.ids
writeFasta(sq.out, "greengenes2_trainset.fa.gz", compress=TRUE)
# Sanity test the results of `assignTaxonomy` on this newly created training
dada2:::tax.check("greengenes2_trainset.fa.gz", fn.test=system.file("extdata", "ten_16s.100.fa.gz", package="dada2"))
# Taxonomic assignments loook OK
# Good to go

Any feedback welcome before we finalize the processing and put the training fasta file up "officially" on Zenodo.

DrJButcher commented 11 months ago

@benjjneb Thanks for posting the code for formatting the updated gg2 database. I took a look at the output and there are a few potential issues with the species column due to the binomial naming that gg2 uses. There are inconstancies in how the genus is propagated down to the species binomial that are currently preventing the sub in the for loop to properly remove the genus in those cases. I've included some code below highlighting the issue and a proposed (somewhat hacky) fix (starting after taxes loop)

`#look for entries that still have a space in their name species.with.spaces=sapply(taxes,function(x) grepl(" ",x[7]))

there are a substantial number

sum(species.with.spaces) head(taxes[species.with.spaces])

many of these are due to discrepancies between how the genus id has been propagated down to the genus/species binomial

some do not have have the trailing letter/numeric present and are thus not dealt with in the original taxes loop

species.with.spaces.char=unique(sapply(taxes[species.with.spaces],function(x) paste(x[6],x[7]))) species.with.spaces.char=species.with.spaces.char[order(species.with.spaces.char)] species.with.spaces.char

proposed changes

taxes.alternate=strsplit(tax, "; ") for(i in seq(length(taxes.alternate))) { gen <- taxes.alternate[[i]][[6]] gen <- substr(gen, 4, nchar(gen))

two step sub to get at "base" genus;

#first remove the potential trailing numeric value, this appears to range from 3 to 6 characters
gen.reduce <- sub("_[0-9]{3,6}$","",gen)
#then remove the trailing letter designation; this is primarily a single upper case letter but some entries have two letters
gen.reduce <- sub("_[A-Z]{1,2}$","",gen.reduce)
taxes.alternate[[i]][[7]] <- gsub(gen, "", taxes.alternate[[i]][[7]])
#if gens is not empty, then do the second sub remove the potential misnamed genus
#some of the misnamed genera have the letter addendum and others do not; the regex for replacement includes the letter as an optional pattern
#the if test is a safety; otherwise the regex will remove the first letter of the listed binomial genus/species; not sure if needed given that we remove these in subsequent steps
if(gen!=""){taxes.alternate[[i]][[7]] <- gsub(paste0(gen.reduce,"(_[A-Z]{1,2})?"), "", taxes.alternate[[i]][[7]])}
taxes.alternate[[i]][[7]] <- gsub("__ ", "__", taxes.alternate[[i]][[7]])

}

check for remaining issues:

binomial.check=sapply(taxes.alternate,function(x) grepl(" ",x[7]))

there are still a few malformed entries

sum(binomial.check) taxes.malformed.binomial=taxes.alternate[binomial.check]

these seem to be issues where there is a mismatch between the listed genus in the genus column and the genus in the binomial species column

species.with.spaces.char.update=unique(sapply(taxes.malformed.binomial,function(x) paste(x[6],x[7]))) species.with.spaces.char.update=species.with.spaces.char.update[order(species.with.spaces.char.update)] species.with.spaces.char.update

test to confirm this:

create a list of vectors for each entry consisting of the listed genus and the binomial species split up

species.with.spaces.char.list=lapply(taxes.malformed.binomial, function(x) c(sub("^[gs]__","", c(x[6],unlist(strsplit(x[7]," "))))))

identify mismatched genus designations by grepping the genus column with the binomial genus

species.with.spaces.char.list.logi=sapply(species.with.spaces.char.list,function(x) grepl(x[[2]],x[[1]]))

mismatched binomials will be false; so pull these out

mismatched.binomials=unique(sapply(taxes.malformed.binomial[!species.with.spaces.char.list.logi],function(x) paste(x[6],x[7]))) mismatched.binomials=mismatched.binomials[order(mismatched.binomials)] mismatched.binomials

there does not appear to by any malformed binomials (ie genus/binomial genus match and there is still a space in the species column)

malformed.binomials=unique(sapply(taxes.malformed.binomial[species.with.spaces.char.list.logi],function(x) paste(x[6],x[7]))) malformed.binomials=malformed.binomials[order(malformed.binomials)] malformed.binomials ` I'm unsure about how to deal with the mismatched binomials; they could simply be removed but there are several species present due to ongoing taxonomic arguments (e.g. Mycoplasma vs Mycoplasmoides https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.003632) that may be important for particular researchers. They could alternatively be kept "as is", to highlight the discrepancy.

benjjneb commented 11 months ago

Thanks @DrJButcher that's super helpful feedback and code!

A follow-up question: Would it be better just to replace the white-space in the species assignments with underscores? I.e. keeping the genus-species binomial as the species-level taxonomic assignment.

The benefit is that this should keep the re-formatting code simpler and so more maintainable if new GG2 releases happen. The downside is that this goes against the way other officially supported assignTaxonomy references encode the species name, and if we want to also produce a reference file for assignSpecies it's something that will need to be dealt with anyway.

DrJButcher commented 11 months ago

I have no issue with simply replacing the space with an underscore. It would make the formatting easier to do and more maintainable.

The format would be different from how the other references are formatted which could be problematic for newer users. However, it is likely that users will need to do some manual curation/editing of the GG2 taxonomies no matter what due to how they are recorded to make sense of the data (there are >20 different versions of Firmicutes_X_NNNNNN alone).

asierFernandezP commented 11 months ago

Thanks for the code to format GreenGenes2 DB to be used in DADA2!

I see than Greengenes2 DB contains now 331,269 sequences (201,438 with species level assignment), which is still lower than the number of sequences in the latest release of the Silva 138.1 prokaryotic SSU taxonomic training dataset (452,064 sequences, with 359,688 assigned to species level). Would you expect an improvement in the taxonomic classification in DADA2 by using Greengenes2 instead of SILVA? Would you recommend to combine both databases (as both contain full 16S sequences)? In case of combining them, apart from the unified formatting of the sequence IDs, would any preprocessing step be needed before merging the GreenGenes DB to the silva_nr99_v138.1_train_set.fa file, and use it for assignTaxonomy() function?

Thank you!

benjjneb commented 11 months ago

Would you recommend to combine both databases (as both contain full 16S sequences)?

No, they don't use the same taxonomy so combining them is not a good idea.

strejcem commented 11 months ago

Hi all, thanks all for the discussion! (EDIT: not relevant anymore)I see the coding effort focusing on removing the 'unique' genus/other levels identifiers. From my perspective, I find these helpful and removing them would be a step back. There is a phylogenetic reason for those. Also GG2 is based on genome backbone. One of the main selling point of GG2, at least for me, is that they link 16S sequences and MAGs together. Trying to make sense of ASV data classified with SILVA and compare them to MAGs classified by GTDBtk is a nightmare.

The issue with mismatch of _g_ and s genus label is more annoying and frankly I think it is on GG2 team to solve it. As I don't know the reason for these mismatches, for my purposes, I extracted the species part of the _s_ field and combine it with the g label to create a fasta for assignSpecies(). If I am committing a taxonomy crime, I am sorry 🤷

benjjneb commented 11 months ago

@strejcem Can you clarify what you mean here?

I see the coding effort focusing on removing the 'unique' genus/other levels identifiers.

strejcem commented 11 months ago

Pardon me, I must have misread the code, I am probably too tired. I edited the previous comment as not relevant anymore.

I see what @DrJButcher suggest is basically the same approach as what I did: to remove the genus names from the Species column that match to the Genus column. This solves the issue of missing GTDB (_[A-Z]+) or GG2 (_[0-9]+) monophyletic identifiers in Species column but leaves the mismatching genus names in Genus and Species columns.

I think just extracting the Species names from Species column is good enough? I.e. regex to erase everything before the space. My approach differs here slightly for addSpecies as I then just stich together genus from Genus column with the extracted Species.

Also, @DrJButcher 's code have few typos due to the markdown editing. Most _ are missing as they translate to italic in markdown.

If anyone is interested in my code using tidyverse:

library(tidyverse)
library(Biostrings)

download.file(
  "http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.full-length.fna.qza",
  "2022.10.backbone.full-length.fna.qza"
)
download.file(
  "http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.tax.qza",
  "2022.10.backbone.tax.qza"
)

unzip("2022.10.backbone.full-length.fna.qza")
unzip("2022.10.backbone.tax.qza")

tax <-
  read_tsv("c16a953c-f24d-4d14-927c-40d90ced395e/data/taxonomy.tsv") %>%
  separate_wider_delim(
    Taxon,
    delim = "; ",
    names = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"
    )
  ) 

# format columns for assignTaxonomy and assignSpecies
formated_tax <- tax %>% 
  mutate(
    # replace unassigned Species with NA
    Species = str_replace(Species, "^s__$", NA_character_),
    # get rid of genus name in species column
    Species = str_replace_all(Species, "^.* ", "s__"),
    # construct dada2 appropriate headers, ignoring unassigned Species
    assignSpecies = case_when(!is.na(Species) ~ str_c(`Feature ID`, Genus, Species, sep = " "))
  ) %>%
  # prepare headers for assignTaxonomy
  unite("assignTaxonomy",
        -c(`Feature ID`, assignSpecies),
        sep = ";",
        na.rm = TRUE) %>%
  mutate(
    # add trailing semicolon
    assignTaxonomy = str_c(assignTaxonomy, ";"),
    # remove all unclassified ranks and their lower variants
    assignTaxonomy = str_remove(assignTaxonomy, "[pcofg]__;.*")
    )

fna_train <-
  readDNAStringSet("a53d9300-5c5c-4774-a2e8-a5e23904f1ae/data/dna-sequences.fasta")

# prepare version for assignSpecies
fna_species <- fna_train

#map new headers to sequence names
fna_tax <- tibble(`Feature ID` = names(fna_train)) %>%
  left_join(formated_tax)

# rename
names(fna_train) <- fna_tax$assignTaxonomy
names(fna_species) <- fna_tax$assignSpecies
# get rid of the unassigned Species names
fna_species <- fna_species[!is.na(names(fna_species))]

fna_train %>%
  writeXStringSet("2022.10.backbone.full-length.dada2.train.fna.gz",
                  compress = TRUE)
fna_species %>%
  writeXStringSet("2022.10.backbone.full-length.dada2.species.fna.gz",
                  compress = TRUE)

# test taxonomy
dada2:::tax.check(
  "2022.10.backbone.full-length.dada2.train.fna.gz",
  fn.test = system.file("extdata", "ten_16s.100.fa.gz", package = "dada2")
)

chassenr commented 10 months ago

Hi @benjjneb and others,

I only found this issue after I already implemented by own solution using seqkit (happy to share if there is interest, but probably redundant with what has already been posted here). I was wondering if it is really necessary to remove the empty taxon prefixes in the assignTaxonomy reference? Even if they are treated like a 'real' taxon, there would still be the possibility to remove them afterwards. I would also argue that having an ASV assigned to an empty taxon prefix provides different information than just getting NA, i.e. sequence can be reliably identified but taxon is not known vs. sequence cannot be identified. What do you think?

I would also like to point out that there are some taxonomic paths, which are incomplete in the middle, e.g. dBacteria; pFirmicutes_B_370518; c; o; f; gCryptanaerobacter; s__Cryptanaerobacter phenolicus. Removing all levels after the first empty taxon prefix would remove actual taxonomic information.

Cheers, Christiane

AnishMalde commented 6 months ago

Hi all. Wondering if there is going to be an "official" final version of the script to use GG2 with Dada2 on Github/Zenodo? Or should I follow the one posted by @strejcem? Also, I used the 515f/806r primers to amplify my samples, so is this the best taxonomy file to use for that data: "2022.10.backbone.full-length.dada2.train.fna.gz"? Thank you.

benjjneb commented 6 months ago

@AnishMalde Thanks for the bump. We will do an "official" version, and I should get to it by the end of the month.

mattHay2 commented 5 months ago

Hi @benjjneb I was wondering if there was any update on the DB? Looking forward to combining it with your epic pipeline :-)

steffie1985 commented 2 months ago

@benjjneb any update on this?

steffie1985 commented 2 months ago

@chassenr may I have a look at the script you mentioned, please?

marianamnoriega commented 2 months ago

@steffie1985 Here are my notes from setting up the database. Since github enforcing 2FA effectively blocked me from using github, I am using a colleagues account to send you this message:

################
#  Greengenes  #
################

### major update (greengenes2) 2022.10:

Download on 10.10.2023.

  cd /bio/Databases/Greengenes/2022.10
  wget http://ftp.microbio.me/greengenes_release/2022.10/00README
  wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.seqs.fna.gz
  wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.taxonomy.id.tsv.gz
  wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza
  gunzip 2022.10.taxonomy.id.tsv.gz

Extract backbone sequences from qza object:

  conda activate qiime2-2021.2
  qiime tools export --input-path 2022.10.backbone.full-length.fna.qza --output-path 2022.10.backbone.full_length
  mv 2022.10.backbone.full_length/dna-sequences.fasta 2022.10.backbone.full-length.fna
  rmdir 2022.10.backbone.full_length
  # check that all sequences are included in the taxonomy file
  grep -c '^>' 2022.10.backbone.full-length.fna
  grep '^>' 2022.10.backbone.full-length.fna | sed 's/^>//' | grep -c -F -w -f - 2022.10.taxonomy.id.tsv
  # yes
  grep '^>' 2022.10.backbone.full-length.fna | sed 's/^>//' | grep -F -w -f - 2022.10.taxonomy.id.tsv > 2022.10.backbone.full-length.tax

# dada2:

Formatting backbone fasta file for use with assignTaxonomy() in dada2 (only down to genus level):

  cd /bio/Databases/Greengenes/2022.10/dada2
  cut -f1,2 ../2022.10.backbone.full-length.tax | sed -e 's/; /;/g' -e 's/;s__.*$/;/' > kv.txt
  conda activate seqkit-2.3.1
  seqkit replace -p '^(.+)$' -r '{kv}' -k kv.txt ../2022.10.backbone.full-length.fna > 2022.10_train.fna
  rm kv.txt

Most taxonomic paths are consistent, either complete or only truncated.
However, there are some genera within p__Firmicutes_B_370518 where the taxonomic path is incomplete in the middle levels.
Also, missing levels are not provided as 'unclassified' or with the last classified level, but only with the rank prefix without anything else
I am not modifying the taxonomy at the moment.
If the taxonomic classification returns an empty rank prefix, it means that there is a similarity to the database, but that the reference was not classified.
If the taxonomic classification returns NA, it means that there is no consensus (i.e. similarity) in the datbase.

Formatting backbone fasta file for use with assignSpecies() in dada2 (only sequences with species assignments):

  cd /bio/Databases/Greengenes/2022.10/dada2
  cut -f1,2 ../2022.10.backbone.full-length.tax | grep "; s__." | sed -e 's/\t.*; s__/\t/' > kv.txt
  conda activate seqkit-2.3.1
  seqkit grep -n -f <(cut -f1 kv.txt) ../2022.10.backbone.full-length.fna > tmp.fna
  seqkit replace -p '^(.+)$' -r '$1 {kv}' -k kv.txt tmp.fna > 2022.10_species.fna
  rm tmp.fna kv.txt

# blast:

Formatting blast database from backbone fasta file:

  cd /bio/Databases/Greengenes/2022.10/blast
  conda activate blast-2.13.0
  makeblastdb -in ../2022.10.backbone.full-length.fna -dbtype nucl -out 2022.10.backbone.full_length_blastdb -logfile 2022.10.backbone.full_length_blastdb.log

You will still be required to use the tax table as lookup for the sseqid in the blast output. taxids are currently not supported.

Best, Christiane (chassenr)

benjjneb / dada2

Updated GreenGenes database formatting for DADA2 #1680

reading/writing fasta

downloading qza files

optional: compressing the resulting fasta files

optional: converting windows line endings to unix

library(TAF)

set the name of the sequence file

download the qza file using curl_download

qza flies are simply zip folders, get the contents using unzip

get the path to the fna file in the qza artifact

import the fna sequences as a dnastring set

get the number of different bases present in each sequence

there are some sequences with non-standard bases present, not sure if these should be removed from the assignTaxonomy or addSpecies files

set the name of the taxonomy file

download the qza file

get the contents and path to taxonomy file as above

import the file as a data.table

convert the sequences into a data.table to merge with the taxonomy information

merge the two data.tables together

the greengenes taxonomy uses the standard 7 level scheme; here I put Domain instead of Kingdom to better match their d__ designation

split up the Taxon string on semicolons to get each level as its own column

remove the leading x__ from the greengenes taxonomy levels

this results in those sequences which are undefined at a level having an empty string for these entries

create the taxonomy string for dada2 assignTaxonomy; this is the first 6 levels separated by semi-colons

sequences that are not defined to the Genus level will have empty entries and thus ";;" or ";;;" at the end of the taxonomy string

these can be removed using gsub and anchoring the regex to the end of the string

note that there is one sequence that is missing part of their middle taxonomy

this sequence appears to be from NCBI; but also has a different taxonomy listed as what is currently listed in greengenes2

https://www.ncbi.nlm.nih.gov/nuccore/AY327251

greengenes2 now saves the binomial genus/species designation in the species column so we only need to paste together feature.id species column

however, not all sequences have a defined genus and species designation, for those sequences, put a value of NA in the addSpecies.id column

create a data.table containing the sequences/ids for assignTaxonomy; here I removed the sequence with the missing inner taxonomy

use the columns in this data.table to generate a DNAStringset that can be written as fasta

write out stringset to be used for assignTaxonomy

optional: convert windows style line endings to unix; requires TAF

compress the output

create a data.table for addSpecies, here I excluded taxa that do not have a bimodial genus/species designation

I also excluded the same taxa from above that was missing the middle taxonomy

use the columns in this data.table to generate a DNAStringset that can be written as fasta

write out the stringset

optional: convert windows style line endings to unix; requires TAF

compress the output

clean up the extracted files; leave the qza files intact

there are a substantial number

many of these are due to discrepancies between how the genus id has been propagated down to the genus/species binomial

some do not have have the trailing letter/numeric present and are thus not dealt with in the original taxes loop

proposed changes

two step sub to get at "base" genus;

check for remaining issues:

there are still a few malformed entries

these seem to be issues where there is a mismatch between the listed genus in the genus column and the genus in the binomial species column

test to confirm this:

create a list of vectors for each entry consisting of the listed genus and the binomial species split up

identify mismatched genus designations by grepping the genus column with the binomial genus

mismatched binomials will be false; so pull these out

there does not appear to by any malformed binomials (ie genus/binomial genus match and there is still a space in the species column)