Open naurasd opened 1 year ago
You're correct that assign.ensembleTax stops at the first NA it encounters as it descends down the "nomenclature tree". We built it this way with efficiency in mind for folks working with big data sets. I'll save this idea as a feature request for the next update - meantime if you're comfortable trying to implement please feel free to submit a pull request.
You could also just fill NA's that are found between valid taxonomic names. Here's a code snippet that will get you most of the way there:
df <- t(data.frame(c("tax1", NA, "tax2", "tax3"),
c("tax7", "tax8", NA, "tax9"),
c("tax4", "tax5", "tax6", "tax0")))
colnames(df) <- c("a", "b", "c", "d")
rownames(df) <- NULL
filler <- "incertae_sedis" # whatever you want to fill NA's with
for (i in 2:length(colnames(df))-1) {
idx <- which(is.na(df[, i]) & !is.na(df[, i-1]) & !is.na(df[, i+1]))
df[idx, i] <- filler
}
Thanks for the reply. Unfortunately, I am lacking the skills to submit pull requests and stuff, so this will have to wait. My bad.
Regarding you solution: I initially had the same idea in mind, but this would create another problem.
Let's say I have 2 taxonomies and will insert "incertae_sedis" as suggested by you:
df1 <- t(data.frame(c("tax1", NA, "tax2", "tax3"),
c("tax7", "tax8", NA, "tax9"),
c("tax4", "tax5", "tax6", "tax0")))
df2 <- t(data.frame(c("tax1", "tax11", "tax2", "tax3"),
c("tax7", "tax8", NA, "tax9"),
c("tax4", "tax5", "tax6", "tax0")))
colnames(df1) <- c("a", "b", "c", "d")
rownames(df1) <- NULL
colnames(df2) <- c("a", "b", "c", "d")
rownames(df2) <- NULL
filler <- "incertae_sedis" # whatever you want to fill NA's with
for (i in 2:length(colnames(df1))-1) {
idx <- which(is.na(df1[, i]) & !is.na(df1[, i-1]) & !is.na(df1[, i+1]))
df1[idx, i] <- filler
}
for (i in 2:length(colnames(df2))-1) {
idx <- which(is.na(df2[, i]) & !is.na(df2[, i-1]) & !is.na(df2[, i+1]))
df2[idx, i] <- filler
}
Let's look at the 2 taxonomies:
> df1
a b c d
[1,] "tax1" "incertae_sedis" "tax2" "tax3"
[2,] "tax7" "tax8" "incertae_sedis" "tax9"
[3,] "tax4" "tax5" "tax6" "tax0"
> df2
a b c d
[1,] "tax1" "tax11" "tax2" "tax3"
[2,] "tax7" "tax8" "incertae_sedis" "tax9"
[3,] "tax4" "tax5" "tax6" "tax0"
We see that in the first row, df1 basically is NA at rank b, but df2 has an assignment. Let's say I run assign.ensembleTax
with count.na=F
to keep the assignment of one taxonomy if the other taxonomy is unknown at this particular rank. The ensemble taxonomy for row 1 would be set as NA, because "incertae_sedis" is not a true NA value and for this rank, a disagreement between df1 and df1 would be considered, instead of keeping the assignment of df2 because df1 is unclassified at this rank.
I understand that this may be a minor issue or "bad practice" to just keep the assignment of one taxonomy in this case if the other one is unknown. Either way, I came up with some alternative although less convenient commands for my case, which works for me.
Thanks for your help though.
Nauras
Hi,
I have encountered a somewhat weird behavior when running assign.ensembleTax. I am using v1.2.2.
I have performed all the required previous steps of your pipeline, i.e. mapping taxonomies onto each other.
Subsequently, I am using pr2.txt and silva.txt as input for assign.ensembleTax. The files are attached for you to reproduce the issue. I am running the function with
count.na=FALSE
.Unfortunately, in some cases, in the output
ensemble
, some ranks are set toNA
, even though both taxonomies agree at this rank.Below, I show row 11 of
ensemble
,pr2
andsilva
.As you can see, for the rank Order_Family_X, both
pr2
andsilva
agree, but the assignment is set toNA
inensemble
. Why? This ASV is just one example, there are more cases like this.My first guess is that this rank is set to NA due to a default behavior, which potentially sets all ranks to NA once a higher rank is NA? Is this the case? This behavior is a bit inconvenient. Eukaryotic taxonomies in some customized Silva and PR2 reference sets can be weird and it may happen that a higher rank is NA even though a lower rank is not.
Thanks for your feedback
Nauras
pr2.txt silva.txt