dcat4 / ensembleTax

source R package
Other
6 stars 3 forks source link

assign.ensembleTax - agreeing taxonomy is set to NA when higher rank is NA #4

Open naurasd opened 1 year ago

naurasd commented 1 year ago

Hi,

I have encountered a somewhat weird behavior when running assign.ensembleTax. I am using v1.2.2.

I have performed all the required previous steps of your pipeline, i.e. mapping taxonomies onto each other.

Subsequently, I am using pr2.txt and silva.txt as input for assign.ensembleTax. The files are attached for you to reproduce the issue. I am running the function with count.na=FALSE.

library(ensembleTax)

pr2<-read.table("pr2.txt",sep="\t",header=T)

silva<-read.table("silva.txt",sep="\t",header=T)

tax_list <- list(pr2, silva)
names(tax_list) <- c("pr2", "silva")
ensemble <- assign.ensembleTax(tax_list, 
                               tablenames = names(tax_list), 
                               ranknames = colnames(pr2[,3:12]),
                               tiebreakz = NULL, 
                               count.na=FALSE, 
                               assign.threshold = 0, 
                               weights=c(1,1))

Unfortunately, in some cases, in the output ensemble, some ranks are set to NA, even though both taxonomies agree at this rank.

Below, I show row 11 of ensemble, pr2 and silva.

row11<-rbind(ensemble[11,],pr2[11,],silva[11,])
rownames(row11)<-c("ensemble","pr2","silva")
row11[,-2]

              svN    Domain Supergroup Division Subdivision              Phylum        Class_X Class_Order_Family Order_Family_X Genus Species
ensemble ASV10006 Eukaryota       TSAR Rhizaria    Cercozoa Filosa-Sarcomonadea Glissomonadida               <NA>           <NA>  <NA>    <NA>
pr2      ASV10006 Eukaryota       TSAR Rhizaria    Cercozoa Filosa-Sarcomonadea Glissomonadida               <NA>     Heteromita  <NA>    <NA>
silva    ASV10006 Eukaryota       TSAR Rhizaria    Cercozoa Filosa-Sarcomonadea Glissomonadida               <NA>     Heteromita  <NA>    <NA>

As you can see, for the rank Order_Family_X, both pr2 and silva agree, but the assignment is set to NA in ensemble. Why? This ASV is just one example, there are more cases like this.

My first guess is that this rank is set to NA due to a default behavior, which potentially sets all ranks to NA once a higher rank is NA? Is this the case? This behavior is a bit inconvenient. Eukaryotic taxonomies in some customized Silva and PR2 reference sets can be weird and it may happen that a higher rank is NA even though a lower rank is not.

Thanks for your feedback

Nauras

pr2.txt silva.txt

dcat4 commented 1 year ago

You're correct that assign.ensembleTax stops at the first NA it encounters as it descends down the "nomenclature tree". We built it this way with efficiency in mind for folks working with big data sets. I'll save this idea as a feature request for the next update - meantime if you're comfortable trying to implement please feel free to submit a pull request.

You could also just fill NA's that are found between valid taxonomic names. Here's a code snippet that will get you most of the way there:

df <- t(data.frame(c("tax1", NA, "tax2", "tax3"), 
            c("tax7", "tax8", NA, "tax9"),
            c("tax4", "tax5", "tax6", "tax0")))
colnames(df) <- c("a", "b", "c", "d")
rownames(df) <- NULL

filler <- "incertae_sedis" # whatever you want to fill NA's with
for (i in 2:length(colnames(df))-1) {
  idx <- which(is.na(df[, i]) & !is.na(df[, i-1]) & !is.na(df[, i+1]))
  df[idx, i] <- filler
}
naurasd commented 1 year ago

Thanks for the reply. Unfortunately, I am lacking the skills to submit pull requests and stuff, so this will have to wait. My bad.

Regarding you solution: I initially had the same idea in mind, but this would create another problem.

Let's say I have 2 taxonomies and will insert "incertae_sedis" as suggested by you:

df1 <- t(data.frame(c("tax1", NA, "tax2", "tax3"), 
                   c("tax7", "tax8", NA, "tax9"),
                   c("tax4", "tax5", "tax6", "tax0")))

df2 <- t(data.frame(c("tax1", "tax11", "tax2", "tax3"), 
                   c("tax7", "tax8", NA, "tax9"),
                   c("tax4", "tax5", "tax6", "tax0")))

colnames(df1) <- c("a", "b", "c", "d")
rownames(df1) <- NULL

colnames(df2) <- c("a", "b", "c", "d")
rownames(df2) <- NULL

filler <- "incertae_sedis" # whatever you want to fill NA's with

for (i in 2:length(colnames(df1))-1) {
  idx <- which(is.na(df1[, i]) & !is.na(df1[, i-1]) & !is.na(df1[, i+1]))
  df1[idx, i] <- filler
}
for (i in 2:length(colnames(df2))-1) {
  idx <- which(is.na(df2[, i]) & !is.na(df2[, i-1]) & !is.na(df2[, i+1]))
  df2[idx, i] <- filler
}

Let's look at the 2 taxonomies:

> df1
     a      b                c                d     
[1,] "tax1" "incertae_sedis" "tax2"           "tax3"
[2,] "tax7" "tax8"           "incertae_sedis" "tax9"
[3,] "tax4" "tax5"           "tax6"           "tax0"
> df2
     a      b       c                d     
[1,] "tax1" "tax11" "tax2"           "tax3"
[2,] "tax7" "tax8"  "incertae_sedis" "tax9"
[3,] "tax4" "tax5"  "tax6"           "tax0"

We see that in the first row, df1 basically is NA at rank b, but df2 has an assignment. Let's say I run assign.ensembleTax with count.na=F to keep the assignment of one taxonomy if the other taxonomy is unknown at this particular rank. The ensemble taxonomy for row 1 would be set as NA, because "incertae_sedis" is not a true NA value and for this rank, a disagreement between df1 and df1 would be considered, instead of keeping the assignment of df2 because df1 is unclassified at this rank.

I understand that this may be a minor issue or "bad practice" to just keep the assignment of one taxonomy in this case if the other one is unknown. Either way, I came up with some alternative although less convenient commands for my case, which works for me.

Thanks for your help though.

Nauras