FrederickHuangLin / ANCOMBC

Differential abundance (DA) and correlation analyses for microbial absolute abundance data
https://www.nature.com/articles/s41467-020-17041-7
108 stars 29 forks source link

Extract normalised counts for linear models outside of ANCOMBC? #260

Open charlesfoster opened 7 months ago

charlesfoster commented 7 months ago

Hi,

Thanks for producing this tool - it looks very useful. A major focus of an ongoing project I'm involved with revolves around properly normalising metagenomic count data for virus taxa after using capture probe-based sequencing. The study has a very complex nested case control (NCC) design to test whether any taxa are significantly associated with a disease outcome of interest. We are intending to run the association analysis using the clogit function from the 'survival' R package.

Am I correct that such an NCC design would not be able to be directly implemented in ANCOMBC? If so, would the best alternative to be to normalise the data using ANCOMBC then extract the normalised counts to use as an input to the clogit function? For example, based on the vignette:

# some params here are specific to the vignette's data set
out = ancombc(data = tse, assay_name = "counts", 
              tax_level = "Family", phyloseq = NULL, 
              formula = "age + region + bmi", 
              p_adj_method = "holm", prv_cut = 0.10, lib_cut = 1000, 
              group = "bmi", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5, 
              max_iter = 100, conserve = TRUE, alpha = 0.05, global = TRUE,
              n_cl = 1, verbose = TRUE)

samp_frac = out$samp_frac
# Replace NA with 0
samp_frac[is.na(samp_frac)] = 0 
# Add pseudo-count (1) to avoid taking the log of 0
log_obs_abn = log(out$feature_table + 1)
# Adjust the log observed abundances
log_corr_abn = t(t(log_obs_abn) - samp_frac)

library(survival)
fit <- clogit(case ~ virus_name + variable1  + variable2 + ... + variableN + strata(nest), data = log_corr_abn)

If this is on the right track, I'm just not sure what to put for the formula and group parameters in the ancombc/ancombc2 function since the 'true' formula for an NCC analysis couldn't be included.

Finally, on another note: would you recommend batch-correcting counts before ancombc (e.g. using ConQuR)?

Any advice would be greatly appreciated.

Charles