Compositionality in GLLVM approach

chassenr commented 4 years ago

Hi @JenniNiku , a colleague recommended your R package for GLLVM and I am very curious, how you applied this approach to microbial data, specifically sequencing data. As sequencing data is compositional, I was wondering how you accounted for this in your approach, as the typical models that are used for species count data (e.g. poisson, NB) are not necessarily suitable for sequencing data?

Thanks!

Cheers, Christiane

JenniNiku commented 3 years ago

Hi @chassenr ! Thanks for the question. I'm not quite sure what the sequencing data mean in practice. Could you explain?

chassenr commented 3 years ago

Hi @JenniNiku Thanks for getting back to me. With sequencing data, I mean sample-by-taxon tables (or similar), i.e. sequence count tables. As sequence counts are constrained by the total library size, they (1) are not independent from each other and (2) cannot be compared across samples. Rarefying such tables to an equal number of sequences per sample has also been shown to be misleading. Therefore I was wondering how you would use such data with GLLVM? How did you take compositionality into consideration in the analysis of the microbial dataset in the vignette?

Thanks!

Cheers, Christiane

BertvanderVeen commented 3 years ago

It sounds to me like you want to include an offset to account for the different number of sequences in samples. However, I'm not sure that answers your question. As long as the sample per species is a count, then applying one of the regular count distributions (Poisson, NB, ZIP) shouldn't be a problem.

chassenr commented 3 years ago

It is more about accounting for the dependence of counts on each other per sample, to avoid spurious correlations. Usually, I use a clr transformation (e.g. ALDEx2, SPIEC-EASI).

BertvanderVeen commented 3 years ago

Would you mind describing your data in detail? At the moment it's not very clear what you're after :).

chassenr commented 3 years ago

I assume my data would be very similar to the microbial dataset used in the vignette. If no internal standard was used for sequencing, the sequence counts will not reflect actual abundances. I will try to explain the compositionality problem with a simple example.

Sample 1: 3 species (absolute abundance: 10, 30, 60; sequence proportions: 0.1, 0.3, 0.6) Sample 2: 2 species (absolute abundance: 10, 30; sequence proportions: 0.25, 0.75)

The total number of sequences per sample (the library size or sequencing depth) has no ecological meaning (it is determine by how much DNA you put on the sequencer), but it constrains the sequence counts per OTU, i.e. all OTU counts always have to add up to the library size (100%). In the example this leads to an 'increase' in the proportions of the 2 species in sample 2 because they have to fill up this sequence space. However, this apparent increase is caused by the compositional character of the data and not a change in actual abundances. Therefore, the counts in an OTU table cannot be used as they are for statistical models and even taking the percentages is insufficient.

BertvanderVeen commented 3 years ago

The above explanation doesn't clarify much for me I'm afraid. I don't know much about sequencing or microbial biology (but maybe Jenni knows more? ).

If you want to model absolute abundances relative to their sequence proportions, an offset should take care of that as far as I can tell.

However, in your previous message you wrote that you want to account for dependence between counts of species in a sample. Are you implying that you expect them to be correlated somehow, i.e. in a pseudo replication kind of way?

Alternatively, you can include additional sample-specific intercepts so that the mean abundance of both samples and species is accounted for outside of the ordination, giving focus on composition.

Do I understand correctly that you have multiple sequences for each sample? If so, how is that reflected in your data?

kbitenieks commented 3 years ago

Hi @chassenr,

Have you tried the example analysis from the vignette? I tried to follow the example with my microbial data (ASV table of 32 samples and 15 000 taxa, 10 environmental variables) but calculations took me days and I ended up with errors or nothing in the plots. I assumed it was because of such a large number of ASVs.

BertvanderVeen commented 3 years ago

@kbitenieks, any specific errors you had trouble with? GLLVMs are complex statistical models, and with so many taxa calculations are bound to take a while. However, just having many taxa should not create any errors. Lack of data in some taxa might, though.

JenniNiku commented 3 years ago

To @chassenr , so is the interest more in the proportions of the absolute abundances than the abundance itself in a sequence? For proportion/coverage data, package has not suitable distribution at the moment. However, we are developing GLLVMs for beta distribution. I haven't considered the possibility for Dirichlet, at all yet, so I don't know if that would work with these models. I don't think the situation is quite the same with the microbial data we used in the vignette, and for the questions we had with that data.

To @kbitenieks , I would recommend to start with a smaller subset of the data where most sparse columns are left out, and see how it works. Also, with large datasets, try to fit a model without calculating standard errors at first, using argument sd.errors = FALSE, as that part takes a long time for huge datasets.

chassenr commented 3 years ago

@JenniNiku My point is that sequence counts do not reflect abundance, and are therefore unsuited for statistical ecological models without transformation. As far as I can tell, the data in the vignette are untransformed sequence counts, which are used directly in GLLVM. Furthermore, the total number of sequences per sample differs between samples. This difference has no ecological meaning. Given the peculiarities of sequence counts, I question whether it is a valid to use such counts in GLLVM. Or did I miss something in the GLLVM package? Introducing a distribution for percentages will also not solve the problem, as sequence percentages are not independent (see compositionality, https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full). Therefore the more recent packages for sequence data analysis, use log ratio transformations.

@kbitenieks : Which kind of input data do you use: full OTU table, rarefied to equal library size, rare OTUs filtered?

BertvanderVeen commented 3 years ago

@chassenr, I would argue that statistical models should be formulated to account for the properties of data, also in the case of sequencing data, so that transformations can be avoided as much as possible (and as much information can be retained in the data as possible). As such, the question to answer is (from my perspective) "what GLLVM formulation accounts for the properties of sequencing data?".

trichterhb commented 3 years ago

As chassenr rightfully pointed out, the sample sizes in a NGS data set, i.e. the total observations in a sample, have no biological meaning, and are technical artifacts inherited by the sequencing device, a now generally accepted "feature" of NGS data. There are two consequences resulting from this feature: i) the need for normalizing for unequal sample sizes and ii) to check if the resulting data is compositional or not.

Traditionally, however, it was either scaling to sample sums (relative/proportional data), or subsampling to an equal sampling size ("rarefying"). Both have their problems, and are disregarded for a number of applications (differential abundance, abundance modelling, correlation analysis), but are still probably fine for beta diversity. Weiss et al, 2017 (Microbiome), is a very good start into this topic.

However, the problem does not end with just adjusting for biologically irrelevant sample sizes. There is a growing school of thought which by default treats any NGS dataset as intrinsically compositional, as the sequencer is not able to deliver infinite amounts of observations, and as chassenr has pointed out, is therefore unable to deliver independent data with the known consequences on frequentist statistics.

I think the gllvm package would greatly benefit from having recommendations how to deal best with the two problems.

BertvanderVeen commented 3 years ago

Does this reference: "Negative binomial mixed models for analyzing microbiome count data", give any leads? They account for library size by including an offset, as suggested above.

trichterhb commented 3 years ago

Thanks for the heads up.

I thing applying offsets is basically the same as using proportional data, isnt it? I remember using this approach for GAMMs with zero-inflated negative binomial distributions on NGS data. Does it deal with compositionality, though? I am not a statistician by training, but if i am correct, compositionality is an issue if you care about independency of your data.

BertvanderVeen commented 3 years ago

@trichterhb I would love to comment on that, but I'm still not completely clear on what you mean by "compositionality". Care to have another try to elaborate?

chassenr commented 3 years ago

Here is an evaluation of NB for sequence data: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224909

trichterhb commented 3 years ago

I think the issue is best explained here: https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full (dont be afraid of the Sith-style title :))

Imagine you are measuring soil texture as either loam or silt, in % per unit of soil. Such a data set only holds relative information, and by knowing loam, you also know silt percentage, giving you only one degree of freedom. Likewise, the issue with NGS data is that the last unknown sample size is determined by the instrument (as it e.g. can only yield 5 million observations per run).

the issues with NGS count data are thus two fold: first, unequal, but meaningless sample sizes for which various normalizations exist (some of which cause compositionality within the sample), second, count totals are fixed by the instrument, and are thus intrinsically delivering compositional data. The second point depends on your personal philosophy (there is no observer in existence which can deliver infinite data, and a sequencer delivers much more observations than an eye), but the field of microbial ecology is definitely moving into adopting this view point.

Since gllvm seem to be a very promising approach and since an NGS data is used as an example if i am not mistaken, it would be extremely helpful to have this method "fit" for recent trends in microbial ecology.

ch16S commented 3 years ago

I'm a bit late to the game here but, I believe, as @BertvanderVeen pointed out by using an offset and a random or fixed row effect for each sample is in effect modelling the data as a composition, as well as accounting for sequencing depth differences.

But I digress- the properties that make compositional data different is that fact they are a simplex. So another set of distributions that could be considered are multivariate categorical distributions. This includes the multinomial, the Dirichlet multinomial (DM), negative multinomial (NM) and the generalized Dirichlet multinomial (GDM) (Kim et al. 2018). It should be noted the the multinomial is the multivariate extension of the Poisson and the Dirichlet multinomial is the multivariate extension of negative binomial. The Dirichlet multinomial has outperformed CoDA methods at detecting true biological signals in compositional data (Harrison et al. 2020). It should be noted Harrison et al. show DESeq2 (a negative binomial based method) is also very good at detecting biological signals).

With this in mind @BertvanderVeen and @JenniNiku could I make request to have the NM, GDM and DM distributions considered for the GLLVM package? The R package MGLM has a regression framework for these distributions. As they are multivariate distribution it make be slight different approach to fit the GLLVMs. It would greatly benefit researchers working with NGS data. When I have these model classes in GRETA I have found the log as well softplus link function work well for these distribution classes.

Thanks Chris

Harrison, JG, Calder, WJ, Shastry, V, & Buerkle, CA 2020, ‘Dirichlet‐multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data’ Molecular Ecology Resources, vol. 20, no. 2, pp. 481–497, doi: 10.1111/1755-0998.13128.

Kim J, Zhang Y, Day J, Zhou H. MGLM: An R Package for Multivariate Categorical Data Analysis. R J. 2018 Jul;10(1):73-90. doi: 10.32614/rj-2018-015. PMID: 32523781; PMCID: PMC7286576.

BertvanderVeen commented 3 years ago

Thanks for that confirmation, @ch16S.

It's an interesting point that you mention, though I'm not too familiar with some of the distributions that you mention. Similarly to the binomial distribution, I'm not sure an analytically tractable solution is available for the multinomial distribution, though I might be wrong. Either how, the development of EVA might make that more straightforward.

Having said that, at first glance it seems to me that those distributions prevent the assumption of independence in the observations (being multivariate in nature, that would make sense!). Currently, that assumption is key in the implementation of GLLVMs in the R-package, since it allows us to provide (in a relatively straightforward way compared to the alternative) tractable solutions to the marginal log-likelihood of GLLVMs. Not assuming independence (e.g., due to spatially structured residuals) makes the maths a lot more complex (for (E)VA at least)!

That's not to say it's not a possibility, I just don't see it happening anytime soon. But @JenniNiku might have a different perspective on that of course.

JenniNiku commented 3 years ago

Interesting points here. I'm not that familiar with all of those distributions, so I would need to get to know them at first.

dwarton commented 2 years ago

Sorry I'm somewhat late to the party here... an appropriate procedure to account for compositionality in GLLVM is to analyse the raw counts but with a row effect in the model (adding the argument row.eff="fixed" or row.eff="random"). This controls for differences in library size so that all remaining terms in the model estimate compositional effects. Bert suggested an offset, which is a similar idea, but it assumes the row effect is known a priori.

There are some curly questions about whether the matrix of model coefficients should then be constrained, partitioned into a "main effect" and remaining compositional terms, but that consideration is more about parameterisation of the model than anything else, and is secondary to the concern of which model to actually fit!

BertvanderVeen commented 2 years ago

Indeed, and my second suggestion a row intercept. Thanks for contributing to the discussion David.

ch16S commented 2 years ago

Out of curiosity is there circumstances where you would use a row offset and a row effect? For example if you wanted to account for library size with an offset, but also use the row effect to account for library sizes' effects on species richness and composition? I'm thinking in the context of metabarcoding data where lower sequenced samples may have fewer taxa and can often be less even.

And in terms of the curly questions about the row effect is there an example of how to partition that code into a main effect and compositional terms?

dwarton commented 2 years ago

row offset and row effect - I guess so, the offset would be accounting for effects we know about, the row effect is picking up residual effects not captured by offset.

code to partition - well I guess I've done something relevant to this in mvabund::manyglm(..., comp=TRUE). Although the start of this discussion was about sequencing data for which this particular function would not be very practical (at least, not without recoding it from scratch, comp=TRUE co-opts code written for a different purpose and it does not come across gracefully).

JenniNiku / gllvm

Compositionality in GLLVM approach #23