HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

Silva taxonomic rank assignment #127

Closed chassenr closed 2 years ago

chassenr commented 4 years ago

Hi @kbseah and @HRGV , I have been discussing the phyloFlash output that Clarissa is working with for her PhD. While I have not worked with phyloFlash in a while (and may therefore be unaware of recent changes and new functionality, I apologize in advance for that), I noticed that the taxonomic paths that she got as output were formatted to (I) repeat the last available classified level regardless of rank, and (II) truncated after 7 levels. Based on your comments in the code I assume this is intentional. I would have a suggestion to resolve the issue of long (i.e. eukaryotic) or incomplete taxonomic path differently, as with the current solution, taxonomic ranks at the same level are not always comparable. SILVA provides a taxon-to-rank map that can be used to assign the correct rank (domain to species) to each possible taxonomic path (including truncated ones). I implemented such a routine rudimentarily in R for prokaryotes (but I also have a newer version somewhere else that works equally well with euks). If there is interest on your side to do something similar for phyloFlash, I am happy to share this approach.

Cheers, Christiane

kbseah commented 4 years ago

Thanks Christiane for pointing this out to us and sharing your code, it is a great idea! We'll look into implementing this for the next phyloFlash release. There is certainly demand for such a solution based on previous feedback (issue #124 ) We'll keep you updated

chassenr commented 4 years ago

Hi @kbseah , just FYI this function includes a newer version of the code to assign tax levels. Sorry, that I did not send this link right away.

Cheers, Christiane

kbseah commented 4 years ago

thanks Christiane!

ckeuschn commented 3 years ago

Hi, I am jumping on this since I am analyzing metatranscriptimic data at the moment. I 've run phyloFlash at -taxlevel 20 to get all possible ranks for eukaryotes. I created a count table for all my samples and now would need a corresponding taxonomy table giving the ranks of all taxa. Is there any workaround you can point me too? Thanks for your help! Cheers Christoph

chassenr commented 3 years ago

Hi @ckeuschn , theoretically, if you get the all possible ranks in your output, you should be able to apply the same workaround that I am using (see this function in combination with the SILVA mapping file). Or you can write your own workaround based on the mapping file that gives you the correct taxonomic rank for each possible path in SILVA.

Cheers, Christiane

ckeuschn commented 3 years ago

Hi Christiane Thanks for that, could you send me and example of how your otu and tax tables are structured exactly, I think this is were I got stuck using your function. Cheers Christoph

chassenr commented 3 years ago

You will only need the taxonomic path as input. This you split into a list (as a base R fan using strsplit), e.g.:

[[1]]
[1] "Eukaryota"             "Amorphea"              "Obazoa"                "Opisthokonta"          "Nucletmycea"           "Fungi"                
[7] "Zoopagomycota"         "Entomophthoromycotina" "Entomophthoromycetes" 

[[2]]
 [1] "Eukaryota"          "Archaeplastida"     "Chloroplastida"     "Charophyta"         "Phragmoplastophyta" "Streptophyta"       "Charophyceae"      
 [8] "Charales"           "Nitellaceae"        "Tolypella"         

The second input the mapping file in the form (the first row is the header):

path    node    rank
Archaea Archaea domain
Archaea;Aenigmarchaeota Aenigmarchaeota phylum
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis  Aenigmarchaeota Incertae Sedis  class
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order    Unknown Order   order
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family Unknown Family  family
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family;Candidatus Aenigmarchaeum   Candidatus Aenigmarchaeum   genus
Archaea;Aenigmarchaeota;Deep Sea Euryarchaeotic Group(DSEG) Deep Sea Euryarchaeotic Group(DSEG) class
Archaea;Aigarchaeota    Aigarchaeota    phylum

I am executing the function and doing some minor data filtering in the example script here

PS: Just FYI, the example of the silva mapping file shown here is from an older silva version.

ckeuschn commented 3 years ago

Hi Christinae

thank you for spoonfeeding this to me! I ran the SILVAtaxopath function with my taxonomic paths:

[[1]]
[1] "Archaea"                             "Crenarchaeota"                       "Nitrososphaeria"                    
[4] "Nitrososphaerales"                   "Nitrososphaeraceae"                  "Candidatus Nitrocosmicus"           
[7] "Candidatus Nitrocosmicus oleophilus"

[[2]]
[1] "Archaea"             "Crenarchaeota"       "Nitrososphaeria"     "Nitrososphaerales"   "Nitrososphaeraceae" 
[6] "uncultured archaeon"

and the Silva mapping file (v115 because this is in the format you used - the newer ones are structured differently apparently)

                                           path                        node   rank
1                                         Archaea                     Archaea domain
2             Archaea;Ancient Archaeal Group(AAG) Ancient Archaeal Group(AAG) phylum
3                           Archaea;Crenarchaeota               Crenarchaeota phylum
4              Archaea;Crenarchaeota;Thermoprotei                Thermoprotei  class
5       Archaea;Crenarchaeota;Thermoprotei;a87Y42                      a87Y42  order
6 Archaea;Crenarchaeota;Thermoprotei;Acidilobales                Acidilobales  order

It throws follwing error:

Error in `[<-`(`*tmp*`, i, as.character(SILVA[SILVA$path == paste(tax[[i]][1:j],  : 
  no 'dimnames' attribute for array

and I got stuck here. Do I miss a package here? Any help is appreciated!

Also, if I understood your code right you have options there to select for certain domains. In my case I want every annotation found by phyloFlash so running you function without these selections would do that right?

Cheers Ch

chassenr commented 3 years ago

Hi @ckeuschn , you need to use the mapping file from the SILVA version that you used to classify your sequences. Otherwise, the function won't work properly. After downloading the mapping file, you will have to format it as required for the function. Sorry that I did not mention this specifically. The columns path and rank are good as they are, and for the column node you will just have to take the last element of the path. You can do this in base R with e.g. sapply(strsplit(silva$path, ";"), function(x) x[length(x)]) assuming that the R object is called silva.

ckeuschn commented 3 years ago

Hi @chassenr, I created a mapping file with the right version from silva and the function worked. I get a matrix with 20 columns corresponding to the ranks and n rows corresponding to the numbers of taxa I would like to parse. However, only some taxa have the correct entries, all others are NA. And it only works for the prokaryotic ranks (domain to genus). I checked a few of them and they are correct, but I don't know how to make that work for every item in my taxa list. Do you have any suggestions where to look for? Thanks Ch

chassenr commented 3 years ago

@ckeuschn Can you share your taxonomic path and the SILVA version that you used? Then I can have a look. It is not uncommon that some taxonomic levels are empty (NA) because not all taxonomic paths include all levels (especially for euks).

ckeuschn commented 3 years ago

Hi, here you find the first 200 taxon paths, the original and formated SILVA file I used and the output it gave me for this.

Cheers Ch

path.txt formatted_silva_mapping.txt tax_slv_ssu_138.1.txt output.txt

chassenr commented 3 years ago

@ckeuschn I suspect the error was in the format of the silva taxmap: the path should not end with a semicolon. I repeated the analysis with the input paths that you provided and the default taxmap downloaded from SILVA:

path <- read.table("path.txt", h = T, sep = "\t", stringsAsFactors = F)
path_split <- strsplit(path$x, ";")
silva <- read.table("../tax_slv_ssu_138.1.txt", h = F, sep = "\t", stringsAsFactors = F)
silva_map <- data.frame(
  path = gsub(";$", "", silva$V1),
  node = sapply(strsplit(silva$V1, ";"), function(x) x[length(x)]),
  rank = silva$V3,
  stringsAsFactors = T
)

SILVAtaxopath <- function(tax, SILVA){
  output <- matrix(NA, nrow = length(tax), ncol = length(levels(SILVA$rank)))
  colnames(output) <- levels(SILVA$rank)
  for (i in 1:length(tax)) {
    for (j in 1:length(levels(SILVA$rank))) {
      if (paste(tax[[i]][1:j], collapse = ";") %in% SILVA$path) {
        output[i, as.character(SILVA[SILVA$path == paste(tax[[i]][1:j], collapse = ";"), "rank"])] <- as.character(SILVA[SILVA$path == paste(tax[[i]][1:j], collapse = ";"), "node"])
      }
    }
  }
  return(output)
}

TAX <- SILVAtaxopath(path_split, silva_map)

write.table(TAX, "taxonomy_split.txt", quote = F, sep = "\t")

That gave me the following output file (still containing all the taxonomic levels, which are useless for prokaryotes, but may be more informative for euks): taxonomy_split.txt

I hope this will now also work for the eukaryotes in your data set. Be aware that SILVA (at least the taxmap) does not include species assignments. Also, you may want to consider replacing e.g. uncultured, Insertae Sedis, unknown with something more indicative of the last classified taxon.

ckeuschn commented 3 years ago

@chassenr Yes, it was the semicolon and I get the table I needed! I will replace the uncultured and unknowns like you did in your script. Thanks a lot for you help, really appreciate it and like this my R skills slowly develop too Cheers Ch