coreytcallaghan / cs_sampling_effort

0 stars 0 forks source link

Preliminary thoughts and logic for comments/feedback #4

Closed coreytcallaghan closed 3 years ago

coreytcallaghan commented 3 years ago

So I started with the data for BCR31 (peninsular Florida) and one year worth of data for some initial exploration (2016). See this script. I also started with a 20 km spatial grain, but created grids of 1, 5, 10, and 20 to check too.

Here is the number of eBird checklists per grid: image

Some super-sampled grids!

Here is a random grid, and the expected species accumulation curve generated from iNext: image

iNext also allows for the sample coverage to be calculated for each grid (detailed in this paper):

image

This is cool, and I think is essentially what I am after in answering the question 'how many eBird checklists to sample a given region/site'.

But here it is conceptually interesting. So, if you calculate the observed sample coverage for each grid (N=289) for that year you get the expected pattern that at some level of sampling, the coverage levels off near 1:

image

And, of course, the number of eBird checklists in a grid cell positively correlates with the observed SR:

image

But conceptually, if the objective is to make a map of needed effort to sample biodiversity in a given temporal domain, so that SR can be compared in time and space, then you don't necessarily need a sample coverage of 1. Plus, sampling is never truly 'complete'. This is discussed in the Chao and Jost 2012 paper more:

image

With this in mind, I calculated the number of eBird checklists necessary to meet 90% and 95% coverage, respectively (i.e., make grids comparable).

image

I initially thought that the number of eBird checklists would actually be strongly correlated still with the underlying number of eBird checklists because it is just a sampling artefact. However, that wasn't the case:

image

And:

image

This is good, as it means that other things explain the variability in the number of samples required to meet a specified level of sample coverage. This is where the predictor data comes in (see #5)! Unsurprisingly, the number of samples corresponds reasonably well with the observed SR at both levels:

image

image

But we don't have observed SR in empty grid cells, so this is where the idea of what predicts species richness and what predicts sampling converges in the storyline.

So, that's where I'm at and thoughts are appreciated as to whether this makes sense logically. I think the biggest conceptual question I have is: does it make sense to strive for 90/95% sample coverage to make regions/sites comparable?

Next step will be to test some predictors and see what predicts that variability in the violin plot above. I've started outlining the paper, and comments are welcome on that whenever you feel motivated.

bowlerbear commented 3 years ago

Hmm - it sounds totally reasonable to say that sampling coverage of 100% isn't necessery - but i think we have to think about which part of the sample we are ignoring by doing this. As I understand it, the first part of these rarefaction curves is determined by the common species, and the later part the rare species. So by cutting the sample early, we are more likely cutting rare species. Depending on what you want to do with predicted richness, this may or may not be a problem.

In any case, it seems like the result might depend alot on where we set this threshold. I might propose to run the analysis twice - once for a complete sample, and one of 90/95% or similar. As your coloured violin plot above shows, there is more variation among sites in the necessary number of checklists when trying to obtain a higher sampling coverage. I guess the difference in mainly about sites differing more in these rare species.

sablowes commented 3 years ago

Nice start, Corey! This looks great, though I lost the thread through the plots at the bottom. I've one question and one suggestion (assuming I know the answer to my question already) to start...

First, am I right in thinking that the first species accumulation curve accumulates checklists? And so we are in the 'incidence' (cf. abundance) based world. I'm asking to confirm, because you've talked about converting checklists to abundances, but you're not proposing to do that here, right?

So, what follows assumes these are incidence data. Chao has extended the sample completeness framework since the 2012 paper with Jost to include the 'tuning' parameter q (see https://esj-journals.onlinelibrary.wiley.com/doi/full/10.1111/1440-1703.12102). q 'tunes' the sensitivity of the estimated completeness to species' frequencies: q > 1 are disproportionately sensitive to highly frequent species, q < 1 are sensitive to infrequent species, and q = 1 weights all species by their detection probabilities (assuming detection is approximated by incidence). If we add this extension, we could get at some of Diana's questions regarding rarity in a quantitative way.

coreytcallaghan commented 3 years ago

Thanks @bowlerbear and @sablowes. I agree with you Diana about common vs rare species. I guess part of my thinking is that from an applied/monitoring perspective it is generally the common species that are mostly of interest anyway - and that's part of the argument there. But I am running it at different levels anyhow to see how it shakes out.

@sablowes, yes that is correct! I am in incidence world here. Abundance is possible with the eBird data, but I am keeping it simple here with incidences. Thanks for highlighting that paper - I wasn't aware of it! I agree that calculating completeness at different values of q would be informative. Do you know if they have implemented it in iNext or another package yet? Or, if the source code is somewhere so we can implement it.

sablowes commented 3 years ago

I had a quick look at the development version of iNEXT (downloaded from GitHub) yesterday after sending this. It does calculate completeness for different values of q. I’d need to take a look under the hood to confirm that it does what they propose in the new paper, but I suspect it does. Other place to look for code would be in the supplement to the new paper. For presentation, I’d likely stick to q = 0 and q = 2 for simplicity. q = 2 gets at the most abundant or common species; q = 0 is more influenced by rare species.

On 9. Dec 2020, at 03:35, Corey Callaghan notifications@github.com wrote:

Thanks @bowlerbear https://github.com/bowlerbear and @sablowes https://github.com/sablowes. I agree with you Diana about common vs rare species. I guess part of my thinking is that from an applied/monitoring perspective it is generally the common species that are mostly of interest anyway - and that's part of the argument there. But I am running it at different levels anyhow to see how it shakes out.

@sablowes https://github.com/sablowes, yes that is correct! I am in incidence world here. Abundance is possible with the eBird data, but I am keeping it simple here with incidences. Thanks for highlighting that paper - I wasn't aware of it! I agree that calculating completeness at different values of q would be informative. Do you know if they have implemented it in iNext or another package yet? Or, if the source code is somewhere so we can implement it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreytcallaghan/cs_sampling_effort/issues/4#issuecomment-741486972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEEEZP4LD3TRFGMBQN54DQTST3O5RANCNFSM4URU34ZA.

coreytcallaghan commented 3 years ago

I'm with you @sablowes and agree it would be a nice addition/tuning parameter to build into the framework - so what we want is this figure for each grid, taken from their paper:

image

It doesn't appear to be changing much though for completeness, aside from the se of completeness. When I run it at q=0,1,2 the completeness is the same, but it should change! However, the estimated species diversity (type=1) is quite different. So perhaps this isn't implemented yet in the devel version? Or, I'm doing something wrong!

Some code to play with to reproduce what I'm doing:

#library(devtools)
install_github('AnneChao/iNEXT')
library(iNEXT)
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

# read in data and join with grids
bird_dat <- readRDS("Data/bcr31_2016_data.RDS") %>%
  left_join(., readRDS("Data/grid_lookups/20km_grid_lookup_bcr31_2016.RDS"))

# get data for a grid to test iNext out
ex_grid <- bird_dat %>%
  dplyr::filter(grid_id==153)

# create a matrix of species x site (sampling event)
# using presence/absence only data
# ignoring abundance data
temp <- ex_grid %>% 
  group_by(COMMON_NAME, SAMPLING_EVENT_IDENTIFIER) %>% 
  summarize(present=n()) %>%
  mutate(present=1) %>%
  pivot_wider(names_from=SAMPLING_EVENT_IDENTIFIER, values_from=present, values_fill=0) %>%
  column_to_rownames(var="COMMON_NAME") %>%
  as.data.frame()

# convert this dataframe into data format for iNext
temp_inext <- as.incfreq(temp)

# now run iNext on the data
# specify knots
k <- ncol(temp)
out1 <- iNEXT(temp_inext, q=1, datatype="incidence_freq", knots=k, nboot=20)

ggiNEXT(out1, type=1)
ggiNEXT(out1, type=2)
ggiNEXT(out1, type=3)

out0 <- iNEXT(temp_inext, q=0, datatype="incidence_freq", knots=k, nboot=20)

ggiNEXT(out0, type=1)
ggiNEXT(out0, type=2)
ggiNEXT(out0, type=3)

out2 <- iNEXT(temp_inext, q=2, datatype="incidence_freq", knots=k, nboot=20)

ggiNEXT(out2, type=1)
ggiNEXT(out2, type=2)
ggiNEXT(out2, type=3)

# get the data
data_from_inext1 <- fortify.iNEXT(out1, type=2)
data_from_inext0 <- fortify.iNEXT(out0, type=2)
data_from_inext2 <- fortify.iNEXT(out2, type=2)
sablowes commented 3 years ago

Not sure you need that whole profile, but I think have the end points could prove very useful.

Figuring out whether something is doing what you think (or not) is easiest when you know the answer already. So your empirical data are not ideal for this task.

I was hoping that the incidence examples in the 2020 paper would be reproducible, but they are not (or at least not for me). The data in the supplement are incomplete (both the coral incidence datasets have entries of '40+' in a vector k, where I think each unique category is needed to do the calculations). I might be wrong here, but I wasn't able to reproduce the inputs needed for the iNEXT functions using these data as of yet.

I also looked into the functions themselves in the package. But that is a rabbit hole I don't want to linger in right now. There multiple versions of iNEXT available (CRAN and the one available at JohnsonHsieh/iNEXT have the same version number I think, but the one at AnneChao/iNEXT definitely looks to be a different, later version). I think the best path forward is to turn the analytic estimator equations in table 1 of the 2020 Ecological Research paper into functions. I can help with that if needed, but won't get to it right away.

jmchase commented 3 years ago

Hey all, I don't have much substantive to add re: the iNext code. I just wanted to say: 1) I agree completeness across different q seems useful, but I'd focus on q=0 and q=2, although Chao also favors q=1 so why not.. 2) I also agree that I'd do this with known data, rather than empirical data. This is what we built MoBsim for. We never included the Chao extrapolation stuff, but it'd be easy enough to take the results over to iNext to see how known variation in parameters that influence richness, evenness influence iNext things: https://cran.r-project.org/web/packages/mobsim/index.html 3) Regarding the seeming problems with code, etc. I have had some extremely wonderful conversations with Anne Chao in the past, and I think she'd be delighted to try to help you bust through these problems and figure out if/where the code isn't working, etc.

coreytcallaghan commented 3 years ago

Thanks @sablowes for the investigative work. It is appreciated! I dove in as well, including Anne's GitHub repo and code. Agree, her repo is ahead of CRAN/JohnsonHsieh. Yeah, I agree with you and I don't think the supp data is reproducible as the paper says the data are: "The incidence‐based species frequency data are summarized as incidence frequency counts and are tabulated in Table S4.3." So the raw data aren't available as far as I can tell.

I tried getting it to work with with data(spider) as well.

So all in all I agree that either we write the functions from the formulas in Table 1 and the supp material (it gets a bit dense for me!), or fire off an email to her and ask for a function/code to reproduce Figure 2a/Figure 3a (which we could obviously chop at q=0, 1, 2). I'll talk to Jon about a strategy for an email.

For now, I guess the framework I have in place is all okay and I can proceed to the next steps. We just will only have completeness at q=0 for now . So can add in q=1 and q=2 as well, assuming they work and we get them to work in the near future.

As @bowlerbear highlights, it could add an interesting part to the story about monitoring and whether monitoring is interested in the rare/common species and how that changes the final number of checklists etc. etc.

coreytcallaghan commented 3 years ago

Hi all.

Anne Chao wrote back about the tuning parameter, q. So just to keep all information in one place. The code for her 2020 paper is here: https://github.com/AnneChao/iNEXT.4step. And I think what we want to pull out is sc_profile() and use that at q=0, 1, 2.

The objective would then be to make that violin plot above, but with three categories, q=0,1,2. And then this tuning parameter could be built into the framework.

I'll have a go at getting her code working tomorrow so we can confirm it is all good.

More soon.

coreytcallaghan commented 3 years ago

Quick update - the code from her repo isn't actually reproducible, as the data aren't there from what I can find, but are somewhere else:

image

Nevertheless, I pulled out the necessary functions for sampling coverage and to calculate the completeness profile. These do reproduce q=1 from iNEXT with dummy data, so I suspect they are working fine. Working on it more now.

sablowes commented 3 years ago

Check out the Data repo on Professor Chao’s GitHub. I think you’ll find what you need.

Equality between q = 1 from old iNEXT and these new estimators is reassuring to me…that was my expectation based on the 2020 paper.

Nice :)

On 16. Dec 2020, at 16:04, Corey Callaghan notifications@github.com wrote:

Quick update - the code from her repo isn't actually reproducible, as the data aren't there from what I can find, but are somewhere else:

https://user-images.githubusercontent.com/28123686/102365640-10018700-3fb8-11eb-962d-34167912c400.png Nevertheless, I pulled out the necessary functions for sampling coverage and to calculate the completeness profile. These do reproduce q=1 from iNEXT with dummy date, so I suspect they are working fine. Working on it more now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreytcallaghan/cs_sampling_effort/issues/4#issuecomment-746436349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEEEZP3FSEMUIR2JKUJ5YELSVDEBHANCNFSM4URU34ZA.

coreytcallaghan commented 3 years ago

Ah nice one! Wouldn't have thought there would be a different repo for data, since it was hard-coded into this repo. But good detective work @sablowes