ChiLiubio / microeco

An R package for data analysis in microbial community ecology
GNU General Public License v3.0
202 stars 58 forks source link

maaslin2 as differential abundance method? #235

Closed marwa38 closed 1 year ago

marwa38 commented 1 year ago

Hi Chi

maaslin2 as differential abundance methods I was advised by a statistician to run GLM for differential abundance and a senior PhD student lately informed me about maaslin2 which is GLM based method for differential abundance analysis. Would you possibly add maaslin2 as a differential abundance method?

Thanks Marwa

ChiLiubio commented 1 year ago

Hi Marwa

Thanks. It is a good method. I have known it for a long time. Now it is in my to-do list.

Best, Chi

ChiLiubio commented 1 year ago

Hi

microeco pakcage has been updated in github. maaslin2 is added in the cal_cor function of trans_env class. As it is a method for finding associations between metadata and microbial abundance data, it is better to put the method in trans_env instead of trans_diff class in terms of the input data format. And plot_cor is also better to show the result. As you can see in the following example, plot_cor show the result of maaslin2 with the 'coef' column of its result, different from the correlation coefficient of pearson/spearman correlations. Note that 'coef' in maaslin2 may comes from different models (glm, lmer, cpglm ...) depending on the input options. The example uses the files inside Maaslin2 package in order to reproduce the example result of Maaslin2 package.

library(Maaslin2)
input_data <- as.data.frame(t(read.delim(system.file('extdata','HMP2_taxonomy.tsv', package="Maaslin2"), row.names = 1)))
input_metadata <- read.delim(system.file('extdata','HMP2_metadata.tsv', package="Maaslin2"), row.names = 1)
library(microeco)
d1 <- microtable$new(input_data, sample_table = input_metadata)
# here we donot need to use cal_abund, since no taxonomy can be used
# cal_abund is generally necessary to get the relative abundance at each taxonomic level
d1$taxa_abund$Species <- d1$otu_table
# reproduce the result in the example of Maaslin2 package
t1 <- trans_env$new(dataset = d1, env_cols = 1:ncol(d1$sample_table))
# use parameter passing to add parameters of Maaslin2 function
t1$cal_cor(use_data = "Species", use_taxa_num = nrow(d1$otu_table), cor_method = "maaslin2", standardize = FALSE, 
    transform = "AST",
    fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'), 
    random_effects = c('site', 'subject'),
    normalization = 'NONE',
    reference = 'diagnosis,nonIBD',
    plot_heatmap = FALSE, plot_scatter = FALSE)
# make a heatmap to show the result
t1$plot_cor(cluster_ggplot = "both", keep_prefix = FALSE, filter_feature = "")

The intermediate files is in tmp_output folder. You can compare it with the demo_output folder from the example running of Maaslin2 package. They are almost same. Thanks for your suggestion.

Best, Chi

marwa38 commented 1 year ago

going through it now but thanks so much Chi

marwa38 commented 1 year ago

which version of maaslin2 you kindly used and R version?

image image

marwa38 commented 1 year ago

same error with another R version image

ChiLiubio commented 1 year ago

Hi,

Please update microeco to v0.17.1 from github with this command.

devtools::install_github("ChiLiubio/microeco")

图片

This version has not been released in CRAN.

Chi

marwa38 commented 1 year ago

can't install getting this error/warnings image

image

ChiLiubio commented 1 year ago

Sorry. I am not sure how it happened.

marwa38 commented 1 year ago

Should it be R version 4.3?

marwa38 commented 1 year ago

many packages are not yet updated up to R4.3, I updated to r-base=4.3 but still getting errors, could you please apply the changes to R version 4.2.2 (2022-10-31)?

ChiLiubio commented 1 year ago

I found I forgot to update the NAMESPACE file in github, leading to the error 'undefined exports'. I have updated it in github. Please try again. I think the failure is not coming from R 4.3, because I have tested the package for both versions. Sorry for the delay.

marwa38 commented 1 year ago

Hi Chi it is working fine, thanks.
I tried to have more than one column for each of the taxonomic rank but I got an error that that it is non-numerical which means that maaslin2 don't accept more than one taxonomic level/rank in different columns. I am trying to make all different rank levels available so that I could do the same as in other differential abundance tables, do you think that is possible as a workaround in microeco? image

Thanks

marwa38 commented 1 year ago

as you can see in the above figure it is genus level and species at the same time.

any possibility to visualize as the case in other differential abundance methods? as in comparative studies, preferable to have the same visualization. image

marwa38 commented 1 year ago

How microeco have dealt with the same genera level in other differential abundance analysis methods? getting this error Error in .rowNamesDF<-(x, value = value) : duplicate 'row.names' are not allowed

# import data
ps.prev.intes <- readRDS("phyobjects/ps.prev.intes.f.rds")

# Replace Species with full name
tax_table(ps.prev.intes)[,"Genus"] <- paste(tax_table(ps.prev.intes)[,"Family"], 
                                                            tax_table(ps.prev.intes)[,"Genus"], 
                                                            tax_table(ps.prev.intes)[,"Species"])
# SPLIT data into pairwise 
# M vs V
ps.prev.intes.M.V <- subset_samples(ps.prev.intes, Regime == "M" | Regime == "V")
ps.prev.intes.M.V <- prune_taxa(taxa_sums(ps.prev.intes.M.V) > 0, ps.prev.intes.M.V)

# splitted

# M vs V
# merge first bet otu and tax table 
tax <- as.data.frame(tax_table(ps.prev.intes.M.V))
tax$features <- rownames(as.data.frame(tax_table(ps.prev.intes.M.V)))
otu <- as.data.frame(t(otu_table(ps.prev.intes.M.V)))
otu$features <- rownames(as.data.frame(t(otu_table(ps.prev.intes.M.V))))

input_data.M.V <- merge(tax, otu, by = "features")
# remove other non-numeric columns and leave the "Genus" column only 
input_data.M.V <- input_data.M.V[, c(-1:-6, -8)]

rownames(input_data.M.V) <- input_data.M.V$Genus
Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
input_metadata.M.V <- as.data.frame(meta(ps.prev.intes.M.V))

excerpt from the input_data.M.V dataframe image

ChiLiubio commented 1 year ago

I will also try to add maaslin2 method in the trans_diff class by invoking trans_env class in order to make the comparable steps fluent and easy. Thanks. Tell me if I miss something.

ChiLiubio commented 1 year ago

Do you mean how to avoid the dulplication of some unclassified or unstandard genera names? To be honest, it is a frustrated problem when doing analysis for all the 16S, ITS and others. In microeco, it is crude to directly filter those taxa with unstandard names as many as possible to reduce unexpected errors in all the analysis.

marwa38 commented 1 year ago

Do you mean how to avoid the dulplication of some unclassified or unstandard genera names? To be honest, it is a frustrated problem when doing analysis for all the 16S, ITS and others. In microeco, it is crude to directly filter those taxa with unstandard names as many as possible to reduce unexpected errors in all the analysis.

Yeah, that is what I mean. NAs means that they are unclassified at the species level but have the same genus name of the other feature with classified species (or unclassified at the genus level but have the same family name of the other feature with classified genus). The problem if something was removed, how truly will be that particular feature count at the end? could the counts be merged? So when you do the trans_diff option, you just remove repeated rows/taxa including the otu/asv counts?

ChiLiubio commented 1 year ago

In trans_diff, there are two types of input on the abundance at different levels, either the relative abundance or original abundance. Both the two inputs refer to the abundance merging when needed. For example, a genus means all the asvs/otus/species under this genus are summed. So the merged taxa names have no duplication (generally with a long prefix to make them unique). What I mean on removing taxa is to delete those have no clear taxonomic name in the following analysis, not the repeated taxa, as merging can not generate any repreated taxa.

ChiLiubio commented 1 year ago

Maaslin2 has also been added to trans_diff class by invoking that in trans_env. Here is my test on the package of github.

# maaslin2
rm(list = ls())
library(microeco)
data(dataset)
data(env_data_16S)
dataset$sample_table <- cbind.data.frame(dataset$sample_table, env_data_16S[rownames(dataset$sample_table), ])
tmp <- c("Temperature", "Precipitation", "TOC", "NH4", "NO3", "pH", "Conductivity", "TN")
t1 <- trans_diff$new(dataset = dataset, method = "maaslin2", taxa_level = "Genus", use_taxa_num = 20,
    standardize = FALSE, fixed_effects = tmp, transform = "AST")
View(t1$res_diff)
# we can also extract the original object to plot heatmap
t1$res_trans_env$plot_cor(cluster_ggplot = "both", keep_prefix = FALSE)

# Another example
library(Maaslin2)
input_data <- as.data.frame(t(read.delim(system.file('extdata','HMP2_taxonomy.tsv', package="Maaslin2"), row.names = 1)))
input_metadata <- read.delim(system.file('extdata','HMP2_metadata.tsv', package="Maaslin2"), row.names = 1)
library(microeco)
d1 <- microtable$new(input_data, sample_table = input_metadata)
d1$taxa_abund$Species <- d1$otu_table
t1 <- trans_diff$new(dataset = d1, method = "maaslin2", 
        fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
        random_effects = c('site', 'subject'),
        normalization = 'NONE',
        reference = 'diagnosis,nonIBD',
        standardize = FALSE,
        taxa_level = "Species", filter_thres = 0.001)
View(t1$res_diff)