biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

Can songbird be used in a Jupyter notebook without command line wrappers? #155

Open jolespin opened 2 years ago

jolespin commented 2 years ago

I'd like to try out songbird but I have a lot of data to prototype.

Creating intermediate files really puts a bottleneck in my workflow and was wondering if I would be able to do the following:

  1. Use songbird functions and classes in my python interpreter
  2. Not install QIIME2
  3. Use pandas DataFrames instead of BIOM (though, this isn't a big deal if I have to use BIOM)

Also, I noticed you have some statsmodels tutorials. Does that mean there is a version of songbird using statsmodels as the backend???

mortonjt commented 2 years ago

Great - Songbird should be able to deal with large datasets with out a problem.

  1. See this jupyter notebook - feel free to follow up since its not well documented
  2. There is a standalone version that doesn't need q2 -- see the readme
  3. Pandas is used internally, so that shouldn't be a problem

Yes statsmodels is in the backend, so all of the formula formatting applies here.

jolespin commented 2 years ago

Thanks @mortonjt , I'll give this a try. Sorry, that was poor phrasing on my end. I didn't mean "large dataset" I meant a bunch of different versions of my features I wanted to run DGE analysis. It's a metatranscriptomics dataset so a lot of hierarchical features I can collapse.

I've installed the standalone version in a separate environment (I'm using Python 3.8 in my main environment). Will try out that version if I'm unable to get the pieces in the Jupyter notebook working properly.

All implementations require tensorflow and there are no purely statsmodels or scikit-learn implementations right?

mortonjt commented 2 years ago

right, everything requires tensorflow atm, but we are currently refactoring everything to use Stan in Birdman

jolespin commented 2 years ago

I've gotten songbird to run (had the ModuleNotFound error but realized I needed to install tensorflow-estimator==1.15.1). I have a few questions regarding running and results:

  1. If you have a disease (CariesPhenotype) you are testing for while controlling for both continuous (Age_zscore) and categorical data (Center, Sex), how should the patsy formula be specified? I was under the impression it would be something like: --formula "C(CariesPhenotype, Treatment('Caries-free')) ~ C(Center) + C(Sex) + Age_zscore" but your documentation suggest it would be more like --formula "C(CariesPhenotype, Treatment('Caries-free')) + C(Center) + C(Sex) + Age_zscore". Does this seem about right? How does the model know the primary target is CariesPhenotype?
  2. Do I need to run a null model to get statistical significance of differentials? If so, what is the recommended way to compare the model w/ the null?
  3. I noticed there is a Training and Testing option. Is it possible to specify custom cv pairs like in sklearn.model_selection.cross_val_score where cv can be `An iterable yielding (train, test) splits as arrays of indices.?
  4. Does this perform CLR normalization in the backend or should I transform the data prior?

For reference, here was my command:

songbird multinomial \
    --input-biom ../../counts/metaT/featurecounts.no_multimapping.formatted.biom \
    --metadata-file ../../metadata/caries.metaT.metadata.lite.tsv \
    --formula "C(CariesPhenotype, Treatment('Caries-free'))" \
    --epochs 10000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir output

Does this fit seem standard?

image

mortonjt commented 2 years ago
  1. Note that when you are fitting a differential abundance model, you are always predicting the abundances -- so the left side of the formula is never specified. Songbird doesn't predict outcomes based on abundances.
  2. There is a section on computing null models and comparing them with the q2 score. See the readme
  3. If you are willing to build a little bit of custom code, yes this is totally fine
  4. Yes it uses CLR coordinates in the backend (actually uses ALR coordinates, but CLR coordinates are outputted).
jolespin commented 2 years ago

Ok that makes sense. I appreciate the clarification, I must have been confusing it w/ logistic regression.

I've read through the README but I'm still a bit confused.

  1. My main confusion on which features to consider in the output. Typically in edgeR/DESeq2/ALDEx2 I make a volcano plot of log2FC vs. -log2(FDR) and choose a suitable FDR threshold based on this distribution. With songbird, it looks like the main output is differentials.

Here's the most relevant info I found regarding differentials:

differentials.tsv: This contains the ranks of the features for certain metadata categories. The higher the rank, the more associated it is with that category. The lower the rank, the more negatively associated it is with a category. The recommended way to view these files is to sort the features within a given column in this file and investigate the top/bottom features with the highest/lowest ranks. (Qurro makes this sort of analysis easier.)

The first column in differentials.tsv contains the IDs of your features (if you plugged in a QIIME 2 feature table and your "features" are ASVs/sOTUs/..., then you can look up the sequence or bacterial name by merging with rep-seqs or taxonomy, respectively). Once you have identified the features that change the most and least (have the highest and lowest differentials) you can plot the log ratio of these features across metadata categories or gradients! https://github.com/biocore/songbird#71-faqs-running-songbird-standalone-

From my understanding, the Q2 metric is for the entire model as a whole and not the individual features. Is there a way to assess statistical significance for features or am I making an incorrect assumption about songbird's objective?

  1. My second question is how the null and main models should be run together.
    2a. Should the parameters be the same (other than the formula) to ensure no parameter bias? 2b. How should you setup the commands to open tensorboard using both models without overwriting the output files like differentials?

In the example, you have your model:

songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
    --epochs 10000 \
    --differential-prior 0.5 \
    --training-column Testing \
    --summary-interval 1 \
    --summary-dir results

and your null model:

songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 10000 \
    --differential-prior 0.5 \
    --training-column Testing \
    --summary-interval 1 \
    --summary-dir results

The only parameter change is the formula specifying the null model. However, when I ran this on mine it appears to have overwritten my differentials.tsv file. Should these be run in the same directory? If not, how can you view them together with tensorboard.

Thanks again for your help. A colleague has been talking about songbird for a few years so I'm finally making it a point to try it out on my dataset.

mortonjt commented 2 years ago

Right ... FDR correction is quite flawed if you don't have absolute abundances. If you don't have absolute abundances, it isn't possible to do any meaningful statistical significance testing on a per-feature level. See our paper on this : https://www.nature.com/articles/s41467-019-10656-5 We don't return any pvalues, with the intention of letting users choose reference frames after model fitting with qurro. Now in the next iteration, we will enable uncertainty to be estimated via Stan, so it'll enable more flexibility for constructing customized metrics.

I don't completely understand question 2a. What do you mean the parameters should be the same?

Regarding your last question, yes save them to separate directory.

jolespin commented 2 years ago

Right ... FDR correction is quite flawed if you don't have absolute abundances. If you don't have absolute abundances, it isn't possible to do any meaningful statistical significance testing on a per-feature level. See our paper on this : https://www.nature.com/articles/s41467-019-10656-5

That's a great point about abundances and differentials. Going to read your songbird and qurro papers to get more familiar with reference frames this week. I encounter differentials in a lot of my datasets so a method to properly analyze them would be really useful.

Now in the next iteration, we will enable uncertainty to be estimated via Stan, so it'll enable more flexibility for constructing customized metrics

Very much looking forward to BIRDMAn. It's always nice seeing some properly implemented bayesian methodologies towards problems I'm actually interested in instead of radon contamination. Helps me understand the underlying stats much better.

I don't completely understand question 2a. What do you mean the parameters should be the same?

I worded this poorly. I meant to ask if parameters such as --epochs or --differential-prior should be the same between our null model and experimental model.

mortonjt commented 2 years ago

Got it - yes it is preferable if epochs and --differential-prior are the same across the runs.

jolespin commented 2 years ago

I had a few questions after reading through the songbird and qurro papers:

  1. image

It mentions Na is the total abundance of sample A and NA is biomass bias. Is biomass bias in this context literal bias from biomass that can only be measured experimentally or is it analogous to total abundance?

  1. image

In your implementation of songbird, are you using this version in equation (3) during your ALR calculation? In the "Interpreting ranks" section, you mention Actinomyces and P. acnes are used as the reference. Is there a way to select the reference feature in the command line version or does this need to be done manually using the MultRegression object from songbird?

  1. I'm trying to understand the following: image

Is this model using ALR ratios to predict the total biomass (true abundance?) of each sample? In my understanding, the coefficients for each ALR transformed feature are differentials and they are converted into CLR (scale? space? transform? coordinates?) using equation (14). I'm still not entirely sure how the metadata is used. For example, in the simple case of diseased vs. healthy. If there are 3 microbes and you are trying to calculate the differentials in the diseased vs. healthy is the model something like this?

MicrobialMassi ~ B1*Taxa1/Taxa3 + B2*Taxa2/Taxa3 + C(Phenotypei)

then the model is fit while (B1, B2) are converted from ALR to CLR coordinates? (I may be misusing the term coordinates here)

  1. One bit I'm still unclear about after reading the paper in detail is how one chooses which features to investigate without any domain knowledge based on the differentials. For example, the logic in Actinomyces and Haemophilus in your pre/post brushing dataset makes a lot of sense but these aren't the most highly differential ranks according to Fig. 2b. If one was to run this on unlabeled data, would it be possible to pick out the same trends or this more or so used for confirming existing hypotheses?

  2. In qurro, it's not clear to me what the numerator and denominator are below in Fig. 2a. If these differentials are from songbird, aren't they the estimated differential values between covariates/conditions? image

  3. Again, this kind of goes back to [4] but after reading the qurro paper, I'm still not sure how to statistically choose which differentials to use from a data-driven perspective. I understand there are reasons for not using frequentist statistics but I'm thinking about automation and reproducibility. Maybe a clustering algorithm on 1D data? Do you have any insight on this? I suspect the null models generated by songbird could be really useful in achieving this but I didn't see these used in the qurro paper.

Apologies for all the questions. I just want to make sure I'm understanding the gist of the algorithm before I interpret the results in a paper I'm working on (and before I describe what's happening to my advisor). Many of my collaborators are still in the "just use edgeR because it works" mentality and do not consider compositionally (trying to raise awareness amongst the collabs).

fedarko commented 2 years ago

I'll let @mortonjt answer your first 3-4 questions, but I think I can help out with questions 5 and 6 (and maybe a bit with questions 2 and 4):

  1. In qurro, it's not clear to me what the numerator and denominator are below in Fig. 2a. If these differentials are from songbird, aren't they the estimated differential values between covariates/conditions?

The differentials are from Songbird, yes. The regression model run using Songbird was comparing each of the fish body sites with sea water samples (with the goal of identifying features that did a good job distinguishing samples from each body site from a sea water sample). Section 1 of the supplementary material (should be freely available online, let me know if you can't access it and I can send you a PDF) of the paper goes into more detail about this. If you're interested in seeing how we ran Songbird to generate these differentials, this notebook demonstrates that (please note that this analysis was done before we had the validation-against-a-null-model stuff added to the Songbird README, so this doesn't account for that).

So, to answer your question: I guess you could think of the y-axis label in this plot, instead of Gill Differentials, as maybe something like log(gill / seawater) + K, to adopt the terminology used in Figures 2(b), 3(a), 4(a), and 4(b) from Morton/Marotz 2019. Not sure if this is a mathematically kosher way of writing that out, though.

The Numerator and Denominator legends in this figure apply to another log-ratio. This log-ratio is computed using a few of the highest and lowest ranked features, going by the rankings of the differentials (this is the general workflow proposed in the Morton/Marotz paper, which the Qurro paper provides one method for doing). In Figure 2(a) of the paper, which you showed here, the numerator of this log-ratio was all features annotated as Shewanella (which, from manual inspection, we observed were mostly highly-ranked for the gill differentials), and the denominator was the 98 lowest-ranked features in the gill differentials. So, for each sample represented in this plot in Figure 2(b) of the Qurro paper:

image

... the y-axis could be rewritten as log(sum of all Shewanella counts within this sample / sum of all 98 lowest-ranked-feature counts within this sample). Different log-ratios of features can be tested and visualized, showing different results (e.g. as shown in Figure 1 of the paper). Notably, the Songbird results don't directly have anything to do with this log-ratio at this point -- we chose these features because they were ranked high / low in Songbird's results, but (as shown in Figure 2(e) of the Morton/Marotz paper) we could probably have gotten similar results by ranking the fold-changes output by edgeR, DESeq2, ALDEx2, etc.

The general takeaway is that, using Songbird and Qurro, we can narrow things down to a log-ratio of these two sets of features that are demonstrably good at separating gill samples from seawater samples. This is certainly not the only log-ratio that could do that, though!

(as part of question 2) [...] In the "Interpreting ranks" section, you mention Actinomyces and P. acnes are used as the reference. Is there a way to select the reference feature in the command line version or does this need to be done manually using the MultRegression object from songbird?

This is the same idea as the stuff described above -- these features were used as the "reference" after computing the differentials, in computing the log-ratios of features based on their rankings in the differentials. I don't think the paper's mention of these as a reference was in relation to the ALR procedure Songbird uses internally.

  1. Again, this kind of goes back to [4] but after reading the qurro paper, I'm still not sure how to statistically choose which differentials to use from a data-driven perspective. I understand there are reasons for not using frequentist statistics but I'm thinking about automation and reproducibility. Maybe a clustering algorithm on 1D data? Do you have any insight on this? I suspect the null models generated by songbird could be really useful in achieving this but I didn't see these used in the qurro paper.

  2. One bit I'm still unclear about after reading the paper in detail is how one chooses which features to investigate without any domain knowledge based on the differentials. For example, the logic in Actinomyces and Haemophilus in your pre/post brushing dataset makes a lot of sense but these aren't the most highly differential ranks according to Fig. 2b. If one was to run this on unlabeled data, would it be possible to pick out the same trends or this more or so used for confirming existing hypotheses?

One nice method we added to Qurro is called "autoselection", where you just take the log-ratio of the top N / bottom N (or bottom N / top N, if you prefer) ranked features based on the Songbird differentials. This is a similar idea as to the denominator in Figure 2 of the Qurro paper, which is just an aggregate of "all of the bottom-ranked features for these differentials". When we apply this to the tooth-brushing data used in the Morton/Marotz paper (taking the log-ratio of the bottom 5% to top 5% of features -- we say -5% in the Qurro interface in order to flip the numerator and denominator), we get:

Screenshot_2021-07-14_18-12-52

... Which very clearly distinguishes the samples based on before / after brushing. This is often a good place to get started -- if the resulting log-ratio gives you a clear distinction between metadata categories (and it often does if Songbird was able to find some sort of signal based on your formula), it can be worth investigating what specific features within these groups of features are causing this separation / if any of these features have a plausible biological reason for causing this separation / etc.

For example, one of the Haemophilus ASVs is included in the yellow group of features (i.e. it's one of the top 5% highest ranked features in Songbird's differentials). We could move on from autoselection to creating log-ratios of features using more targeted filtering -- for example, keeping the numerator as an aggregate of the lowest-ranked features (say, those with differentials less than -1), but keeping the denominator as just the features in Haemophilus. This looks something like:

Selected Features Resulting Log-Ratio
lr2 lr

We could stop here, or we could try to filter the numerator to a single taxonomic group of features as well. (Actinomyces isn't included in the numerator features shown here -- its differential is slightly too high for that -- but perhaps we could select it anyway based on prior knowledge about the oral microbiome for what we'd expect to be a particularly "stable" microbe.)

So, to sum up: to an extent it's possible to identify these trends just from the data, but there are of course lots and lots of "degrees of freedom" here. Ideally, the whole manual inspection thing roughly described here would be replaced with a formal procedure describing how exactly to test for discriminatory log-ratios (removing the obvious temptation of trying a hundred different log-ratios until something separates the data). It should be possible to describe an algorithm that looks at these differentials and the log-ratios of features in a more automated way that is less prone to human bias (the ideas you propose with clustering / looking at null models are interesting!), but to my knowledge that exact sort of method doesn't exist (yet).

That said, there have been some recent interesting papers that try to select useful log-ratios for distinguishing between groups of samples (albeit without running Songbird first): CoDaCoRe and selbal both aim to solve this general problem.

Hope this helps clarify things!