prelim results - Githubissues

coreytcallaghan commented 3 years ago

Hi all. I'm pretty happy with #4 now. So, just some prelim results, for the 20 km grid resolution only currently.

What I did was use the functions from Chao et al. 2020 to derive a completeness profile for each grid cell that had >25 unique eBird checklists, and then selected observed completeness at q=0, 1, 2. I then joined these with the predictor variables and ran a simple random forest model where the response variable was the number of eBird checklists in a grid and the predictors were the completeness estimate for the chosen q, habitat heterogeneity in that grid, and fractional cover of tree, urban, and water within that grid. Using 20% of data as training data, these random forests performed relatively well to predict the number of checklists, but at q=0 this was not as good as at q=1 or q=2:

I think this means that the predictor variables included generally do a good job of explaining the number of checklists to estimate richness, but at q=0 the model isn't as good at explaining the rarer species. (Makes general sense).

I then predicted the number of checklists necessary to sample species richness across all grids in the study extent (remember, the random forest was only fit on a subset of grids) using the observed values of the predictor variables, but held the estimate (i.e., completeness) constant at 0.85, 0.9, 0.95 and 1.

These results can be summarized as this:

Take-aways:

1) It always takes more checklists to sample 'full completeness' which makes sense 2) q=1 and q=2 show virtually no difference, so for presentation I think I will stick with q=0 and q=2 only 3) q=0 requires more sampling than q=1 and q=2 which makes sense given q=2 is sensitive to rare species 4) There is very little difference between coverage of 85, 90, and 95%. 5) The pattern is constant across years, so need to think whether to collapse years or present years to demonstrate the relative robustness (the underlying sampling is somewhat drastically increasing across years, so the fact the patterns are robust is pretty positive I think)

More interestingly, is then our ability to predict this back out into space and create a map of cs effort needed to sample richness in a given year. For this I just chose q=1 and coverage = 95%:

prelim_20km_fig_maps

This pretty much matches with my a priori expectations, and my experience of birding in Florida for years. So I think this is pretty cool!

Next, I am churning through the analysis for resolutions of 10 and 5 km (and maybe 1) to see if these patterns are robust.

bowlerbear commented 3 years ago

couple of papers that are relevant for this idea of how to sample a community and differences in optimal strategy between rare and common species https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2664.12252 https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12842 The specht paper shows that you sample rare species best by sampling more where they are likely to be. For common species, the best strategy is just a random go-everywhere approach.

I am really liking the common vs rare species angle of the work you have done so far. Can we show the differences spatially between these 2 groups of species and where/how much is needed to sample. Could we either (1) use the common vs rare separation that Thore showed us (but maybe that is unpublished?) or (2) show spatially the discrepancy between the number of checklists needed for q=0 or q=2?

On Thu, Dec 17, 2020 at 12:43 PM Corey Callaghan notifications@github.com wrote:

Hi all. I'm pretty happy with #4 https://github.com/coreytcallaghan/cs_sampling_effort/issues/4 now. So, just some prelim results, for the 20 km grid resolution only currently.

What I did was use the functions from Chao et al. 2020 to derive a completeness profile for each grid cell that had >25 unique eBird checklists, and then selected observed completeness at q=0, 1, 2. I then joined these with the predictor variables and ran a simple random forest model where the response variable was the number of eBird checklists in a grid and the predictors were the completeness estimate for the chosen q, habitat heterogeneity in that grid, and fractional cover of tree, urban, and water within that grid. Using 20% of data as training data, these random forests performed relatively well to predict the number of checklists, but at q=0 this was not as good as at q=1 or q=2:

[image: image] https://user-images.githubusercontent.com/28123686/102482956-514f7080-4064-11eb-90eb-682d7eaac0ed.png

I think this means that the predictor variables included generally do a good job of explaining the number of checklists to estimate richness, but at q=0 the model isn't as good at explaining the rarer species. (Makes general sense).

I then predicted the number of checklists necessary to sample species richness across all grids in the study extent (remember, the random forest was only fit on a subset of grids) using the observed values of the predictor variables, but held the estimate (i.e., completeness) constant at 0.85, 0.9, 0.95 and 1.

These results can be summarized as this:

[image: image] https://user-images.githubusercontent.com/28123686/102483238-ba36e880-4064-11eb-8ae8-9f8c4a2aca1c.png

Take-aways:

It always takes more checklists to sample 'full completeness' which makes sense

q=1 and q=2 show virtually no difference, so for presentation I think I will stick with q=0 and q=2 only

q=0 requires more sampling than q=1 and q=2 which makes sense given q=2 is sensitive to rare species

There is very little difference between coverage of 85, 90, and 95%.

The pattern is constant across years, so need to think whether to collapse years or present years to demonstrate the relative robustness (the underlying sampling is somewhat drastically increasing across years, so the fact the patterns are robust is pretty positive I think)

More interestingly, is then our ability to predict this back out into space and create a map of cs effort needed to sample richness in a given year. For this I just chose q=1 and coverage = 95%:

[image: prelim_20km_fig_maps] https://user-images.githubusercontent.com/28123686/102483634-4fd27800-4065-11eb-96d8-03bd2fa76a10.png

This pretty much matches with my a priori expectations, and my experience of birding in Florida for years. So I think this is pretty cool!

Next, I am churning through the analysis for resolutions of 10 and 5 km (and maybe 1) to see if these patterns are robust.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/coreytcallaghan/cs_sampling_effort/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKWDTYHAVFQKD2QFQW7XNDSVHVGFANCNFSM4U7QQAZQ .

sablowes commented 3 years ago

I like Diana's idea of producing maps showing the the # checklists required to meet a specific target completeness for rare (q = 0) and common (q = 2) species (option 2). Thore's methods, and MoB more generally, have been developed for abundance data only at this stage.

bowlerbear commented 3 years ago

sorry I am going backwards but anyhow...

to understand this all a but more I have been looking into the sc_profile function (functions from Chao) and the estimateD (in iNext) one. Just for data for 2019/20 km.

The "Estimate" (sampling coverage) is from sc_profile and the rest are from from the estimateD function. I am not quite sure what the t is?

Heterogeneity positively relates to observed and iNext-predicted richness (qD for q =0). And richness also positively relates to the number of checklists and the estimated sampling coverage.

To be quick and dirty, I first did

summary(glm(number_checklists ~ Estimate+qD,data=output0,family=poisson))
Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.1240887  0.0405191  -27.74   <2e-16 ***
Estimate     5.0825102  0.0468617  108.46   <2e-16 ***
qD           0.0290475  0.0001537  189.05   <2e-16 ***

I then predicted the number of needed checklists when Estimate =0.95

So this suggests, we just need to predict qD to get a prediction for number of needed checklists.

Then I did

summary(glm(qD ~ heterogeneity + urbancoverfraction + treecoverfraction,data=newdata))
Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        75.726139   3.012940  25.134  < 2e-16 ***
heterogeneity       0.004539   0.001126   4.029 7.33e-05 ***
urbancoverfraction  0.349252   0.076919   4.541 8.54e-06 ***
treecoverfraction   0.128547   0.069658   1.845   0.0661 .

Not sure this helps you - help me - I am understanding the problem more for myself anyhow :)

I will play more later

sablowes commented 3 years ago

Bit too cryptic for me to offer much help, sorry.

I think t is sample size.

bowlerbear commented 3 years ago

yeah, sorry, more just parking some graphs here for my own reference, than doing anything useful!

jmchase commented 3 years ago

Still, fun to see activity on this project! Do let us know if/when we can help....

coreytcallaghan / cs_sampling_effort

prelim results #6