Bray-curtis statistical test

adriaaula commented 4 years ago

Hello there!

I wanted to know if there is a way to test with betta changes in the bray-curtis dissimilarity measures across an environmental parameter. I have tried this:

lee_phylum <- tax_glom(Lee, taxrank="Phylum")
divnet_phylum <- lee_phylum %>%
  divnet(ncores = 4)

estimates <- simplifyBeta(divnet_phylum,
                          lee_phylum,
                          "bray-curtis", "char") %>% pull(beta_est)
ses <- sqrt(simplifyBeta(divnet_phylum,
                         lee_phylum,
                         "bray-curtis", "char") %>% pull(beta_var))

bt.bray <- betta(estimates, ses, divnet_phylum$X)$table

I guess that beta diversity is a dissimilarity measurement between two samples and not a measurement of one sample and therefore you cannot predict a change with the betta approach. What would be the best way to statistically test that there is more differences between different conditions?

Additionally, when I compared the bray curtis dissimilarities in my data, I observe different values between the comparison winter.autumn and the comparison autumn.winter. Does this makes sense? I would expect an identical violin plot for these cases.

My intention is to know which are the most dissimilar seasons and obtain an estimate of the general dissimilarity for each case.

Thanks a ton for everything!

adw96 commented 4 years ago

Great question! I will use the Lee dataset as an example.

You can definitely test the null hypothesis that the tree Bray-Curtis distance between glassy and water samples is equal to the true Bray-Curtis distance between biofilm and water samples. (The alternative hypothesis is that they are not the same.)

Here's how you do it:

Estimate the true BC distance between all types of samples:

divnet_phylum_char <- lee_phylum %>%
  divnet(X = "char", ncores = 4)
sb <- simplifyBeta(divnet_phylum_char,
                   lee_phylum,
                   "bray-curtis", "char")

Plug the estimated BC distances into betta() along with their uncertainties, and specify that you want to test for a difference between them

# H0: BC(glassy-water) = BC(biofilm-water)
betta(sb %>% pull(beta_est) %>% head(2), 
      sb %>% pull(beta_var) %>% sqrt %>% head(2), 
      cbind("intercept" = c(1,1), "water" = c(0,1)))$table

(I'm using head since glassy-water is first and biofilm-water is second in sb)

With $p=0.58$, we do not reject the null hypothesis that Bray-Curtis distance between glassy and water samples is equal to the true Bray-Curtis distance between biofilm and water samples.

Hope that helps!

adw96 commented 4 years ago

Oh -- regarding your question about the violin plots. I'm honestly not sure what's causing this. What are the violin plots showing the distribution of? What settings did you use to run DivNet?

I would note that while the distribution of autumn-winter is not the same as the winter-autumn, the difference seems very marginal and would be very surprised if this changed your results. Any good statistical hypothesis test would acknowledge that there's large variance for both distributions.

adriaaula commented 4 years ago

That was a fast response, thank you!

The violin plots are showing the estimate of the Bray Curtis distance distributions without taking into account the error (is there a way of doing that? I don't think so). I ran DivNet at the Genus level with a time series dataset (default values).

As you say the difference is marginal, and both distributions present a large variance because I have many years in the data.

Thanks for everything

adriaaula commented 4 years ago

Ok now I see what happened 😄 I didn't specify the covariate in the divnet call. Here:

divnet_phylum <- lee_phylum %>%
  divnet(ncores = 4)

instead of

divnet_phylum <- lee_phylum %>%
  divnet(X = 'char', ncores = 4)

The sheer number of points in the violin plot is also due to that reason.

I hope this helps somebody 😅

adriaaula commented 4 years ago

Plug the estimated BC distances into betta() along with their uncertainties, and specify that you want to test for a difference between them

In your comparison you used both glassy-water and biofilm-water. We could perform in the same way a comparison between any pair with this (like for example BC(altered-glassy) = BC(biofilm-water)) ?

In the case we commented from Lee dataset it would we written like this:

betta(sb %>% pull(beta_est) %>% .[c(2,6)], 
      sb %>% pull(beta_var) %>% sqrt %>% .[c(2,6)], 
      cbind("intercept" = c(1,1), "covar" = c(0,1)))$table

Does this make sense?

Another question: when testing the beta diversity, I checked with betta the comparisons presenting the biggest difference. The output is the following:

           Estimates Standard Errors p-values
intercept  0.4843485       0.1316977    0.000
winter    -0.2596066       0.1855310    0.162

If I am interpreting the results correctly, this result means that the Bray curtis distance estimate difference is -0.25 (with a big standard error). The pvalue makes me think that such difference is not as big as i was expecting therefore? A plot of the bray curtis comparisons:

season_test.pdf (The betta test is between Summer.Winter and Autumn.Spring).

adw96 / DivNet

Bray-curtis statistical test #49