Unclear output with four conditions

matthiaswietz commented 2 years ago

Dear Frederick,

first of all many thanks for this package! Just tested to detect taxa enriched by season using

bac <- pseq %>% tax_glom("Genus") %>% filter_taxa(.,function(x) mean(x) > 10, T)

out = ancombc(bac, formula = "season", p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000, group = "season", struc_zero = T, neg_lb = F, tol = 1e-5, max_iter = 100, conserve = T, alpha = 0.05, global = T)

I do get an output, which is however confusing: there is "seasonspring", "seasonsummer" and "seasonwinter" but no "seasonfall".

Also, I'd interpret that TRUE in res$diff_abn means that a given taxon is enriched under the specific condition (for example, if TRUE under seasonspring it's enriched in spring). But the results do not match other DA tools I've been testing.

Many thanks for your help! Matthias

FrederickHuangLin commented 2 years ago

Hi Matthias,

Thanks for your interest in ANCOMBC.

I guess the season variable is a categorical variable with four levels: spring, summer, fall, and winter, right? In R, the categorical variable will be ordered alphabetically by default, so the levels are: fall < spring < summer < winter, which means fall is the reference level.

Therefore, for the outputs you saw, seasonspring means spring - fall, seasonsummer mean summer - fall, etc. You won't see seasonfall as it is essentially 0.

res$diff_abn will give you a hint whether a taxon is differentially abundant (TRUE) or not (FALSE). To see whether a taxon abundance is increasing or decreasing, you can go to res$beta. For example, seasonspring = 0.5 means the average abundance in spring is exp(0.5) times more than that of fall.

Best, Frederick

matthiaswietz commented 2 years ago

Thanks Frederick for the detailed explanation! Maybe I'm confused now, but how would I convert the output to identify the taxa enriched in fall, if this is "only" the reference level? Eventually, I was hoping to pinpoint the taxa significantly enriched per season, and not only in comparison to fall. Many thanks again!

FrederickHuangLin commented 2 years ago

Hi @matthiaswietz,

I understand your concern, but unfortunately, all current differential abundance methods can only handle the difference, not the individual effect size. Here is an illustrative example, consider a simple linear regression with a categorical variable sex as the only covariate

# Load the data
data("Salaries", package = "car")
# Linear regression model
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef

The output looks like

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   101002       4809   21.00 2.68e-66
## sexMale        14088       5065    2.78 5.67e-03

where 101002 is the average salary among females, 101002 + 14088 is the average salary among males, and 14088 is the average difference in salary between males and females. ANCOM-BC is able to estimate the difference (14088 in this example unbiasedly), but not able to estimate the intercept term (101002 in this example, or the effect size of fall in your case) unbiasedly like all other methods.

Hope it makes sense, Frederick

xvazquezc commented 2 years ago

Hi @FrederickHuangLin , I was looking for info in a related matter and I found this issue. I understand what you say, which is more or less what's in the tutorial, but what it threw me out was the Figure 6a in the ANCOM-BC paper as it shows all possible comparisons between the 3 countries, something that I'm interested, and pretty sure that is closer to what @matthiaswietz was enquiring about. How did you run that? This is also referred to in Supplementary Tables 2 and 3 for example where you refer to them as "Pairwise tests using ANCOM-BC". Is there a hidden pairwise test option or is it required something else to get there? Thank you in advance.

FrederickHuangLin commented 2 years ago

Hi @xvazquezc,

The figure 6a you referred to was generated by applying two-group comparison to different subsets of data, i.e., the subgroup of "MA" and "US", the subgroup of "VEN" and "US", and the subgroup of "MA" and "VEN", respectively.

We understand that this is not the perfect and statistically rigorous way to handle pairwise comparisons, and we have a prototype extension to ANCOM-BC but the updated algorithm has not been fully tested. We put all resources at this moment to develop a correlation methodology and it is about to wrap up. The formal ANCOM-BC extension will come next. Stay tuned!

Best, Frederick

papelypluma commented 2 years ago

Hi @FrederickHuangLin, thanks for sharing this tool. If I may just ask a question in this thread since it's somewhat related. I was wondering if there's a way to re-level the levels of a variable i.e. manually setting the reference level besides the default alphabetical order. Thanks a lot!

Update: One way I can re-level is to change the names of the covariate levels (in the metadata file before generating the phyloseq object) in such a way the first level in the alphabetical arrangement is the reference group.

FrederickHuangLin commented 2 years ago

Hi @papelypluma,

Yes, you can definitely relevel the groups in R. For example, you can refer to this post.

Best, Frederick

tsa4a12 commented 2 years ago

Hi @papelypluma,

Yes, you can definitely relevel the groups in R. For example, you can refer to this post.

Best, Frederick

I think the problem is that the formula argument only accepts strings instead of factor levels/designs as input (which is commonly used in similar tools), which makes it incredibly difficult to change reference using the relevel() command.

It would be great if we can input something like ancombc(.... , formula = ~condition+batch,..) instead of ancombc(.... , formula = "condition+batch",..)

of course we can do something like condition = str_replace_all(condition,pattern = "reference_of_interest",replacement = "A") as @papelypluma said

FrederickHuangLin / ANCOMBC

Unclear output with four conditions #51