JenniNiku / gllvm

Generalized Linear Latent Variable Models
https://jenniniku.github.io/gllvm/
48 stars 19 forks source link

Validity of gllvm (and copulas) on compositional data #125

Closed mike-kratz closed 1 year ago

mike-kratz commented 1 year ago

Hi there!

@dwarton @JenniNiku @BertvanderVeen I know people have asked this question in the past but I'm not sure if I ever saw a definitive answer. Does the fact that microbial 16S sequencing data is compositional (i.e., it can only be thought of in terms of relative abundance rather than absolute) and therefore non of the species/taxa have independence lead to invalid covariance/correlation structures from gllvms and copulas?

One way to look at this is if one species' (sp. A) actual abundance in the environment increases while all other taxa (spp. B-Z) stay at the same abundance, sequencing data would show a negative correlation between A and all B-Z even though this is just an artifact of the sequencing method; the relative abundance of A increases with the relative abundance of B-Z decrease even though their abundances haven't changed. Unfortunately, the amount of reads returned for a sample has little to due with the amount of microbial cells, so this causes a whole host of statistical issues and requires all methods to deal with the data being compositional rather than absolute counts.

Included are some plots I've generated for the same dataset using various ordination methods, they seem to be relatively consistent in the pattern they are showing. Therefore, I am curious if the dependence structure of this data type is actually leading to issues with model-based ordinations or if most of the species having spurious correlations "cancel each other out": I'm not entirely sure how this works out mathematically. This is just a single case example, but I've seen it in some of my other datasets too. Just for background information, this data if from 16S rRNA microbial data taken along two rivers, Spring Creek and the Imperial River on three dates in Florida. Orange and black points are from the "dry season" from both rivers and blue points are from the wet season in one river.

NMDS with relative abundance data with "Bray-Curtis" (really percentage difference): filt.bray.ra = vegdist(filt.ra, method = "bray") set.seed(1) NMDS.filt.ra = metaMDS(filt.bray.ra, k = 2, autotransform = FALSE)

Stress = 0.071 (obviously this is dependent on sample size, but sometimes it is useful))

image

image

PCoA with relative abundance data with "Bray-Curtis" (really percentage difference): image

Distance-based redundancy analysis (DB-RDA) with relative abundance data with "Bray-Curtis" (really percentage difference) and four constraining environmental variables: Nitrate+Nitrite, electrical conductivity, total organic nitrogen, and sucralose: final.db.rda = vegan::dbrda(filt.ra ~ EC+NOx+TON+Sucr, data = impute.sim.meta$ximp, distance = "bray") image

Unconstrained GLLVM with random intercepts using the negative binomial distribution and sequencing depth of each sample as an offset: Mean-variance plot image

set.seed(1) fit.gllvm.genera = gllvm(y = filt.counts, family = "negative.binomial", offset = log(filt.meta$library_size), row.eff = "random", num.lv = 2, sd.errors = FALSE) GLLVM ASV Uncon NB Random Effects

Copula using the negative binomial distribution and sequencing depth of each sample as an offset: set.seed(1) my.glms = manyglm(my.mvabund ~ 1 + offset(log(library_size)), data = sim.meta, family = "negative.binomial") my.copula = ecoCopula::cord(my.glms) image

BertvanderVeen commented 1 year ago

Thanks Mike! Since this is not an issue, but a discussion, I will convert it to that (see discussions tab). I will have a more thorough read of your post tomorrow!