biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

Interpretation of differentials #119

Open dadahan opened 4 years ago

dadahan commented 4 years ago

How do you decide which taxa are reported in the differential ranking? Are the differentials that are reported the coefficients that are the most changing, or those that are significant, or some combination of both? Or is it all taxa but reports as fewer than input bc of the --min-feature-count and --min-sample-count. The text from the paper reads:

"The coefficients from multinomial regression analysis can be ranked to determine which taxa are changing the most between samples. "

But how songbird chooses which taxa to report in the differentials.tsv is still unclear to me. If it is those that are most changing, what is the cutoff for the coefficient? Is it a sparse model (eg within a standard error of accuracy reported from a nonsparse model)? Is it a hard cutoff of the top n taxa ranked by coefficient? Is there some kind of p-value reported for a cutoff? If it is reporting all features, is there a good rule of thumb for those which are considered most significant?

In any case, I am curious how, in a paper's text, one would explain the choice to use the reported differentials as those that distinguish treatment(s).

I saw #83 but this is still unclear to me.

Thank you!

fedarko commented 4 years ago

The end of your first paragraph is correct, I think -- all features should be reported in the differentials. The only reason features should be filtered is if they'd fail the --min-feature-count check. However, it's probably worth noting that the --min-feature-count check is done after the --min-sample-count check (the relevant code is here), so you could conceivably have a silly cascade effect where a feature is present in a lot of low-count samples, and then those samples are filtered out, which in turn causes the feature to be filtered out. (I doubt that particular scenario would be a big problem for most datasets, but if you're working with low-biomass data I guess it could be a problem -- in this case it might be a good idea to look into adjusting the --min-sample/feature-count cutoffs.)

I'll defer to @mortonjt on "a good rule of thumb for those which are considered most significant".

Ivy-ops commented 4 years ago

Hi @fedarko and @dadahan Has this question been resolved? I am also wondering on the "a good rule of thumb for those which are considered most significant". Would that be possible to have an answer? Thanks!!

fedarko commented 4 years ago

Oof, sorry for letting this slip. I'll take a stab at answering your question about significance, with the caveat that @mortonjt (and @lisa55asil) are much more authoritative sources on this than I am.

As far as I'm aware, there isn't really a way (...at least right now) to get something like a p-value describing the significance of a taxon's place in the differential rankings. Something worth considering is that we don't know (from relative data) how things are absolutely changing: it's possible for all (or at least most) taxa to be changing in a similar way across sample conditions -- an example of this is shown in the toothbrushing example from Fig. 2 of Songbird's paper, where basically everything is decreasing in absolute abundance from before to after brushing (albeit some taxa, e.g. Haemophilus, are decreasing more than others).

This ties into this section of the paper, emphasis mine:

It is important to note that the hypothesis tests that ANCOM and ALDEx2 use may not be consistent with the absolute differentials. Under perfect conditions when the absolute differentials are centered around zero (Supplementary Fig. 1a–d), both ANCOM and ALDEx2 correctly infer that microbes with a differential close to zero are likely not changing.

However, if the center of mass changes and the average microbe is now decreasing on average −2 log fold (Supplementary Fig. 1e–h), both ALDEx2 and ANCOM will incorrectly infer that microbes changing −2 log fold are not changing. In this example, the center of mass reference frame is inappropriate, and leads to predictions that microbes are not changing when they are actually changing on an absolute scale. This highlights difficulties when attempting to link information from relative data to absolute data using hypothesis tests. The hypothesis tests that ALDEx2 and ANCOM perform here are not necessarily incorrect, but could be misleading in situations where microbial load differs dramatically among conditions.

So, these sorts of cases demonstrate that attempting to classify certain features' rankings as "significant" or not can be tricky.

This comment is kind of rambling so I'll try to wrap this up. In practice, manually looking at the rankings can be useful: depending on the formula you used in Songbird, the differential rankings you get should indicate features' associations with some metadata field(s) of interest to you. So taking the log-ratio of some of the top-ranked to bottom-ranked features is often a useful way to compare groups of samples: and from there, you can see what particular features (or types of features, e.g. features with similar taxonomic classifications) seem to be particularly influencing separation between samples' log-ratios. You can use normal statistical tests on these log-ratios, if you really want to get a measure of "significance" (personally I prefer just quoting effect sizes, but I'm sure other people will have different opinions :).

Qurro was designed to make this general approach easier: this part of a tutorial shows one basic "workflow" for this, albeit with ALDEx2 instead of Songbird outputs.

Anyway, hope this helps!

Ivy-ops commented 4 years ago

@fedarko , Sincerely appreciate your answer! Pretty comprehensive and useful to me. Thanks!!

sformel commented 3 years ago

@fedarko I think I understand what you're describing, but for clarification, when you say "personally I prefer just quoting effect sizes", you mean describing the log ratio as an effect size, i.e. fold change?

It also leads me to another question. Traditionally, exploratory studies have described the distribution of taxa between sites or treatments at a convenient rank, like Class, Order, or Family to give an overview of results. Is it possible to aggregate the differentials in the same way?

For example, would it be valid to use the mean/median differential or rank to convey that Orders A,B, and C were more abundant at Site 1 than Site 2? Or would a better approach be collapsing the biom table at some taxonomic rank (like tax_glom in phyloseq) and rerun the songbird model to get the differentials at the Order level?

fedarko commented 3 years ago

when you say "personally I prefer just quoting effect sizes", you mean describing the log ratio as an effect size, i.e. fold change?

This was in reference to reporting differences in selected log-ratios (of ranked features) between samples -- an example of this is Figs. 1(D) and 2(D) from this paper, where we just reported the R^2 between log(Shewanella / reference taxa) and estimated fish age rather than a p-value. The realization of "hey it looks like this log-ratio decreases with age" was not a planned outcome of this study (as the paper mentions, it was just something interesting that we noticed while exploring the fish data), so rather than attempt to describe the significance of the effect we just described its magnitude and left it at that.

Of course, you don't have to avoid p-values with this stuff... this forum thread goes into more detail.


Traditionally, exploratory studies have described the distribution of taxa between sites or treatments at a convenient rank, like Class, Order, or Family to give an overview of results. Is it possible to aggregate the differentials in the same way?

For example, would it be valid to use the mean/median differential or rank to convey that Orders A,B, and C were more abundant at Site 1 than Site 2? Or would a better approach be collapsing the biom table at some taxonomic rank (like tax_glom in phyloseq) and rerun the songbird model to get the differentials at the Order level?

My opinion is that the former approach (using the uncollapsed data, and then looking at where the members of Orders A, B, and C fall within the rankings) will be a better idea for most use cases, or at least the ones I can think of. There is often a lot of functional variation within a given taxonomic level (e.g. different E. coli strains doing wildly different stuff), so it might make sense that different features within Order A are present in multiple places in the rankings for a good reason.

Figs. 1(A) and 2(A) in the paper I linked above are a nice example of this -- most of the Shewanella (red)-classified features are highly ranked (i.e. are associated with gill samples), but there's that one Shewanella feature over on the left side of the rankings. This feature's ranking and/or taxonomic classification could be a fluke, but it could also be occupying a different environment / serving a meaningfully different function than the other Shewanella features in the plot; and whatever the true reason, this is the sort of pattern that collapsing your data will probably mask. (That being said, if you explicitly don't care about this close of a resolution and just want to look at things at a set level, then I guess collapsing wouldn't be too bad.)


These answers are getting kind of subjective, so ...these are my perspectives on these problems as a non-statistician and non-biologist, at least. @mortonjt please feel free to yell at me if I messed something up ;)

sformel commented 3 years ago

Like always, thank you for the fast and thoughtful response! The compositional data world is still new to me, and every time I review the Songbird paper or Qurro paper I grasp a little more.

I agree that keeping the high resolution of the uncollapsed data is the better way to go, and I think I understand now how I can move forward with creating a general description. I'm not a statistician either, so it's nice to get some feedback. Your subjectivity is partly what I was interested in because I'm sure I will hear lots of opinions in the review process :-)