beta diversity calculations

MoBiodiv / mobr

Tools for analyzing changes in diversity across scales

Other

23 stars 18 forks source link

beta diversity calculations #255

Closed dmcglinn closed 5 months ago

dmcglinn commented 4 years ago

I've been working on updating the mobr code for computing the beta diversity metrics. I'll post a link to my prototype code in a bit. Here are the main changes I'm in the process of implementing let me know if you have suggestions or questions: The main function for carrying out diversity metric estimation is get_mob_stats. I want that function to:

estimate diversity at the sample (alpha), group (gamma), and uber (study) scales.
estimate beta diversity between gamma and alpha scales and study and gamma scales
extend the bootstrap procedure currently implemented only for group (gamma) estimates to also estimate beta diversity interval and study scale diversity interval.
remove default behavior or using the permutation test of randomly swapped sites for testing of significance

Some quick questions I have:

How does "study" scale sound in lieu of "uber" scale? An alternative that Harte uses is "anchor" scale or I could also see "pool" scale being appropriate.
@sablowes and @jmchase indicated that the permutation test implemented in get_mob_stats was often not appropriate or was inappropriately interpreted by users. Can you please elaborate so we can update the documentation with these notes and modify the default behavior appropriately. I still feel that there are certain contexts in which a random sample swap is appropriate to test the hypothesis of are the diversity indices different between treatments. It's a pretty standard non-parametric procedure for doing this.
@T-Engel is the plan that we're going to copy the beta_C package functions into mobr or are you wanting to keep that as a separate R package?

dmcglinn commented 4 years ago

Quick follow up additional question on beta_C. I'm working with the invasion dataset which has some quirks but is always my go to test dataset in part because it is a bit quirky. Anyways here is the target coverage for the two treatments after removing any sites that have zero individuals:

uninvaded   invaded 
0.8262747 0.2112928

So the recommendation would be to then take the smallest coverage value 0.21 as the target coverage to calculate beta_C with but unfortunately this coverage corresponds to less than one individual in the case of the uninvaded treatment so an NA is returned.

> betaC::beta_C(comm_tmp[groups_tmp == grp, ],
+                                          .7)
[1] 0.9967836
attr(,"C")
[1] 0.7
attr(,"N")
[1] 6
> betaC::beta_C(comm_tmp[groups_tmp == grp, ],
+                                          .5)
[1] 0.9957572
attr(,"C")
[1] 0.5
attr(,"N")
[1] 2
> betaC::beta_C(comm_tmp[groups_tmp == grp, ],
+                                          .3)
[1] NA
attr(,"C")
[1] 0.3
attr(,"N")
[1] 1

One thing that doesn't make sense is when I do the inverse N calculation from coverage:

sapply(group_levels, function(grp)
+                                betaC::invChat(comm_tmp[groups_tmp == grp, ],
+                                              target_coverage))
uninvaded   invaded 
 18.89982  52.04609

I see that at that coverage really I should be at a higher number of individuals. For BCI that gives the approriate results for the 55% coverage. @T-Engel any recommendations here for how to proceed? Thanks!

sablowes commented 4 years ago

This all sounds good to me, Dan. I like the 'study' scale, but that is perhaps due to my current 'synthesis' focus, though it seems applicable even when you are analysing data from a single study. One use case where the permutation is inappropriate is for observational studies, where there are potential confounding or other covariates that should be included in any statistical test of the 'treatment' effect, e.g., our Mediterranean MPA study.

Another thought here, following on from Thore and Jon's reminder that we really want this function to be calculating metrics in a flexible way, I wonder if a change of name to get_mob_metrics (or similar, but not 'stats') is worthwhile?

jmchase commented 4 years ago

Hey Dan, I just talked to Thore at length and I think he has an idea of what the current concerns/issues are. In essence, we want the function to be able to provide all of the 'metrics' of interest, even from a single matrix. Then, the driver can do whatever they want with those analyses. The permutation test (and much of the structure) is based on the supposition that there are two treatments to be compared, but this is often not the case.

jmchase commented 4 years ago

Regardless, I personally don't like 'study level', because this is agnostic to scale. One option is

plot scale=alpha site scale=gamma

jmchase commented 4 years ago

oops, sorry. I just read more carefully what you were asking

I like: Sample or plot for alpha site or group for gamma study or pool for 'uber'

T-Engel commented 4 years ago

Yes, the betaC code should be integrated into mobr. The the betaC package I wrote rather serves the purpose of a research compendium for the paper.

Concerning the invasion dataset, the problem is that removing empty samples is not enough. There are some samples that have only very low abundances (1,2,3,4 individuals). I would probably remove those.

library(mobr)
library(betaC)
library(tidyverse)
data("inv_comm")
data("inv_plot_attr")

# data as nested dataframe excluding samples with low abundances
inv= cbind(inv_comm, inv_plot_attr) %>% 
    group_by(group) %>% 
    nest(mat=starts_with("sp"),
         sample_coords= c(x,y)
         ) %>% 
    mutate(mat=map(mat,
                   function(x) { x[rowSums(x)>4,]}
                   ),
           mat=map(mat,
                   function(x) { x[,colSums(x)>0]}
           ),
           samples= map_dbl(mat, nrow),

    )

inv

> inv
# A tibble: 2 x 4
# Groups:   group [2]
  group     mat                sample_coords     samples
  <fct>     <list>             <list>              <dbl>
1 uninvaded <tibble [50 x 83]> <tibble [50 x 2]>      50
2 invaded   <tibble [39 x 65]> <tibble [50 x 2]>      39

As a resullt, the number of samples is unbalanced between groups. Therefore, I use the resampling procedure betaC::beta_stand() to calculate betaC based on 39 samples.

samples_stand = min(inv$samples)

#calculate target coverage
inv<- inv %>% mutate(C_target= map(mat, beta_stand, setsize= samples_stand,func = list("C_target") ),
                     C_target= map_dbl(C_target, function(x) x[1,1])
                     )
inv
> inv
# A tibble: 2 x 5
# Groups:   group [2]
  group     mat                sample_coords     samples C_target
  <fct>     <list>             <list>              <dbl>    <dbl>
1 uninvaded <tibble [50 x 83]> <tibble [50 x 2]>      50    0.831
2 invaded   <tibble [39 x 65]> <tibble [50 x 2]>      39    0.542

Now calculate betaC for a coverage of 0.542

C_target_min= min(inv$C_target)
inv<- inv %>% 
    mutate(betaC= map(mat, beta_stand, 
                      setsize= samples_stand,
                      func = list("beta_C"), 
                      args=list(C= C_target_min) , 
                      summarise=F))

Now we plot it:

inv %>%
    unnest(betaC) %>% 
    ggplot(aes(x=group,y=betaC))+
    geom_boxplot()

So the invaded site is more spatially aggregated than the uninvaded one.

The problem you're experiencing with invChat happens because you're supplying a matrix, I think. The function expects a vector of abundances. That means in your example you probably need to use colSums()on your matrix.

dmcglinn commented 4 years ago

Thanks everyone for the very helpful input!

I've just pushed an additional commit to the dev branch: https://github.com/MoBiodiv/mobr/commit/316776f4960cdb04b7229ae5749471d0a18e733c which I think takes us a step closer of where we want to be. Now you can write the following:

library(mobr)
library(betaC)
inv= cbind(inv_comm, inv_plot_attr) %>% 

    group_by(group) %>% 
    nest(mat=starts_with("sp"),
         sample_coords= c(x,y)
         ) %>% 
    mutate(mat=map(mat,
                   function(x) { x[rowSums(x)>4,]}
                   ),
           mat=map(mat,
                   function(x) { x[,colSums(x)>0]}
           ),
           samples= map_dbl(mat, nrow),

    )

# compute normal div stat
index = 'S'
div_groups <- inv %>% mutate(div = map(mat, calc_comm_div, index = index))
div_groups
head(div_groups$div[[1]])
  scale index sample_size effort coverage value
1 alpha     S           1     NA       NA    12
2 alpha     S           1     NA       NA     7
3 alpha     S           1     NA       NA    11
4 alpha     S           1     NA       NA    11
5 alpha     S           1     NA       NA     5
6 alpha     S           1     NA       NA     5
tail(div_groups$div[[1]])
   scale  index sample_size effort  coverage     value
48 alpha      S           1     NA        NA  9.000000
49 alpha      S           1     NA        NA 10.000000
50 alpha      S           1     NA        NA 22.000000
51 gamma      S          50     NA        NA 83.000000
52  beta beta_S          50     NA        NA  8.185404
53  beta beta_C          50     40 0.8262747  1.309023

abund_study = matrix(colSums(inv_comm), ncol = ncol(inv_comm))
div_study = calc_comm_div(abund_study, index = index)
div_study
  scale  index sample_size effort  coverage value
1 alpha      S           1     NA        NA   111
2 gamma      S           1     NA        NA   111
3  beta beta_S           1     NA        NA     1
4  beta beta_C           1   9668 0.9967475     1

The key changes are:

made a tidy little function for computing pt diversities called calc_div https://github.com/MoBiodiv/mobr/blob/316776f4960cdb04b7229ae5749471d0a18e733c/R/mobr_boxplots.R#L161
made it easier to work with the function calc_biodiv from the nested list data structure that @T-Engel has proposed and I like
uses betaC functions to compute beta coverage
computes metrics at the alpha, gramma, and study scales.

I think it would be a good idea to report for every diversity statistic three attributes:

samples size: how many plots
effort: how many individuals
coverage: the estimated coverage that the diversity estimate was taken at.

Right now you can see that sometimes these are reported correctly and there are placeholders but this needs improvement.

To do from here:

[ ] fix the wrapper function get_mob_stats to work with this improved work flow
- [ ] include subsampling as Thore does above when unbalanced design
[ ] fix downstream graphics work with new data products
[ ] update plot_rarefaction to include study scale curve

T-Engel commented 4 years ago

Good job, Dan! I like that you got rid of calc_biodiv() and created 2 other functions instead. So calc_div() works on a single vector and calc_comm_div() works on a matrix where a row is an alpha sample and the total matrix is the gamma of that group.

I also like that calc_comm_div() can give you all three scales. Maybe it's worth thinking about whether we want to add a "scales" argument to specify which ones should be returned. I'm not sure if it is a good idea for betaC to be returned as the default when index is "S". That can be misleading, I think. Especially, because now you coded it such that target coverage is directly determined from the matrix rather than supplying a user-defined target coverage that would be constant across all sites. Maybe "beta_C" should just become its own index?

I think it's a good idea to report, sample size, effort and coverage with each estimate. In beta_C I only spit out coverage and number of individuals as attributes.

dmcglinn commented 4 years ago

Hey Thore, yes calc_div and calc_comm_div are as you described.

I went ahead and quickly added a scales argument as you suggested - I think that's smart: https://github.com/MoBiodiv/mobr/commit/101eb15d4797412aeb32bb97b293d2b757735e3a

One potentially useful way to avoid having a separate function for beta_C is to add a diversity index called S_C that we encode in the function calc_div. This could either accept a target coverage or if none is supplied go ahead and compute a target coverage to use. Then beta_C would be spit out when someone ran the following:

calc_comm_div(inv_comm, index = 'S_C', scales = 'beta')

let me know what you think.

dmcglinn commented 3 years ago

hey folks, I just pushed a script to dev branch examining some grasshopper data from Konza prairie: https://github.com/MoBiodiv/mobr/commit/a82d5e057b36e99dbae8a57d9ceb5a32ba305265 Can you please take a look. In particular I need a tip on how to use the nested data structure that you like for generating graphics via ggplot2 (@T-Engel or @sablowes). I'm not sure I carried out the nesting in the appropriate manner here as well: https://github.com/MoBiodiv/mobr/blob/dev/scripts/case_studies.R#L110

Here is the prelim look at the patch-burn treatment vs control pastures for the study across 9 years of data where 4 replicates are sampled each year in each treatment group. Note this is a subset of a much bigger dataset down to just two pastures - the pasture id (i.e., Watershed) is a mess and I'm waiting to hear back about this but this is good for now. So the very simple look at the IBRs here suggest that the patch-burn treatment does not increase diversity as expected in fact it actually decreases it a bit.