ChiLiubio / microeco

An R package for data analysis in microbial community ecology
GNU General Public License v3.0
182 stars 56 forks source link

trans_abund plots not matching taxa_abund tables #165

Open jblwt67 opened 1 year ago

jblwt67 commented 1 year ago

Hello,

Thank you for this great package. I am encountering an issue with both my own data and the tutorial data, which I was hoping you could help me with.

In taxa_abund generated tables, taxa that are not classified at a particular level (e.g. Genus) are still counted, and so in the table they represent a group with an associated relative abundance value. For example, in the tutorial data, "kBacteria|pChloroflexi|cAnaerolineae|oAnaerolineales|fAnaerolineaceae|g" are the most abundant Genus present in genus_abund tables (Genus_abund.csv), despite only being classified up to the Family level and not at the Genus level (i.e. "g__ "). This is desirable to me, since I do not want to lose information regarding the presence and abundance of the unclassified groups, even if it is simply reflecting all of the OTUs that are taxonomically unclassified at the genus level (but not at the family level) in the taxonomy_table.

However, when compiling and plotting the same tutorial data via trans_abund:

# create trans_abund object t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 10)

t1$plot_bar(others_color = "grey70", facet = "Group", xtext_keep = FALSE, legend_text_italic = FALSE)

The following plot is produced:

A

In this plot, unlike in the Genus_abund table, the most abundant genus is no longer the unclassified Anaerolineaceae group ("kBacteria|pChloroflexi|cAnaerolineae|oAnaerolineales|fAnaerolineaceae|g"), but is instead the most abundant taxa in the Genus_abund table with a full taxonomic classification down to the Genus level. In the plot, the most abundant genus is now Sphingomonas (g__Sphingomonas). In the Genus_abund table, this clade is the 11th most fractionally abundant group, as groups unclassified at the genus level (i.e. "g__ "), are still counted).

I am wondering whether it is possible to get the trans_abund plots to match the taxa_abund tables, including unclassified clades? I would ideally like my relative abundance plots to match the taxa_abund tables.

I tried to get around this by using 'delete_full_prefix = TRUE', but this doesn't fix it:

B

I have now had this issue on multiple datasets, including on tutorial data following precisely the same data clean-up and curation commands as set out in the tutorial. I would really appreciate an insight into why the taxa_abund and trans_abund plots are not matching, and how the trans_abund plots can be made to match the taxa_abund data.

I am running the latest version of microeco and R.

Thank you for your help!

ChiLiubio commented 1 year ago

Hi @jblwt67 Good question! You are right. Currently, the function filters all the unidentified taxa, as those taxa are not clear and generally convery little information to the readers in the plot according to my experience (maybe not accurate that I say). I have thought to add the option to show those unknown taxa as you want to show. The final reason that I donot do is those taxonomic information can interfere many operations if user is not familar with the design of the package. For example, those inconsistent lineages can influence the lefse analysis. So I generally suggest users clean the taxonomic table with tidy_taxonomy function. To temporally resolve the plot issue, we can change the tax_table by replacing the 'g' to other thing as the function recognize all the 'g' as null taxonomy and delete those taxa. The following is my test.

library(microeco)
library(magrittr)
data(dataset)
# replace g__ with g__unclassified
dataset$tax_table$Genus[grepl("__$", dataset$tax_table$Genus)] %<>% paste0(., "unclassified")
# must use cal_abund to recalculate the abundance
dataset$cal_abund()
# delete_full_prefix = FALSE is necessary to show full taxonomy
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 10, delete_full_prefix = FALSE)
t1$plot_bar()

Best, Chi

jblwt67 commented 1 year ago

Hi Chi,

Thank you very much for your response - your test solution worked really well. Much appreciated!

For me, this is a really useful feature, since I work a lot with poorly taxonomically resolved archaeal groups that are often very abundant, but not yet taxonomically classified below even the Class level. Conveying the presence of these unclassified groups in plots is therefore often very important.

Just to clarify, would you not recommend using the adjusted tax_table in other downstream analyses as this may cause issues? I can definitely see it being a problem for lefse, but are there any analysis classes where the adjusted tax_table could still be used accurately?

Thanks again for your help and for this great package!

ChiLiubio commented 1 year ago

Hi,

Thanks for your appreciation! The classes depending on the result of cal_abund function are mainly trans_abund and trans_diff. It does not refer to the choice of customized change of tax_table. I just mean you should note that the operation may cause some unpredictable error or issue. I think it is fine to use the adjusted tax_table for you for most cases as you want. The reason of lefse issue is the cladogram require a strict and consistent lineage to correctly show the taxa in the plot. So just have a try.

Best, Chi