Hi there!

@dwarton @JenniNiku @BertvanderVeen I know people have asked this question in the past but I'm not sure if I ever saw a definitive answer. Does the fact that microbial 16S sequencing data is compositional (i.e., it can only be thought of in terms of relative abundance rather than absolute) and therefore non of the species/taxa have independence lead to invalid covariance/correlation structures from gllvms and copulas?

One way to look at this is if one species' (sp. A) actual abundance in the environment increases while all other taxa (spp. B-Z) stay at the same abundance, sequencing data would show a negative correlation between A and all B-Z even though this is just an artifact of the sequencing method; the relative abundance of A increases with the relative abundance of B-Z decrease even though their abundances haven't changed. Unfortunately, the amount of reads returned for a sample has little to due with the amount of microbial cells, so this causes a whole host of statistical issues and requires all methods to deal with the data being compositional rather than absolute counts.

Included are some plots I've generated for the same dataset using various ordination methods, they seem to be relatively consistent in the pattern they are showing. Therefore, I am curious if the dependence structure of this data type is actually leading to issues with model-based ordinations or if most of the species having spurious correlations "cancel each other out": I'm not entirely sure how this works out mathematically. This is just a single case example, but I've seen it in some of my other datasets too. Just for background information, this data if from 16S rRNA microbial data taken along two rivers, Spring Creek and the Imperial River on three dates in Florida. Orange and black points are from the "dry season" from both rivers and blue points are from the wet season in one river.

NMDS with relative abundance data with "Bray-Curtis" (really percentage difference): filt.bray.ra = vegdist(filt.ra, method = "bray") set.seed(1) NMDS.filt.ra = metaMDS(filt.bray.ra, k = 2, autotransform = FALSE)

Stress = 0.071 (obviously this is dependent on sample size, but sometimes it is useful))

PCoA with relative abundance data with "Bray-Curtis" (really percentage difference):

Distance-based redundancy analysis (DB-RDA) with relative abundance data with "Bray-Curtis" (really percentage difference) and four constraining environmental variables: Nitrate+Nitrite, electrical conductivity, total organic nitrogen, and sucralose: final.db.rda = vegan::dbrda(filt.ra ~ EC+NOx+TON+Sucr, data = impute.sim.meta$ximp, distance = "bray")

Unconstrained GLLVM with random intercepts using the negative binomial distribution and sequencing depth of each sample as an offset: Mean-variance plot

set.seed(1) fit.gllvm.genera = gllvm(y = filt.counts, family = "negative.binomial", offset = log(filt.meta$library_size), row.eff = "random", num.lv = 2, sd.errors = FALSE) GLLVM ASV Uncon NB Random Effects

Copula using the negative binomial distribution and sequencing depth of each sample as an offset: set.seed(1) my.glms = manyglm(my.mvabund ~ 1 + offset(log(library_size)), data = sim.meta, family = "negative.binomial") my.copula = ecoCopula::cord(my.glms)

JenniNiku / gllvm

Validity of gllvm (and copulas) on compositional data #125

Stress = 0.071 (obviously this is dependent on sample size, but sometimes it is useful))