Closed DomHenry closed 2 years ago
That would be great thanks @DomHenry! We would probably need one function for each package e.g. abapToUnmarked
, abapToSpOccupancy
, etc.? Or do you think we could come up with a common structure for all of them? I think unmarked's format is quite different from the others isn't it?
Sure thing @patchcervan! I'm not sure about a common structure that can be used for all of the packages. I suspect it may not be possible. Instead of multiple functions we may want to have a single function with an argument that takes the name of the occupancy modelling package and returns a package-specific object? In any case I'll start by looking at the input format across the various packages and see what level of agreement/disagreement we're working with.
That would be great thanks Dom. I am thinking, perhaps for tidiness we can meet in the middle, and create a single "user-facing" function and a bunch of helpers that take care of formatting the output for different packages.
Ah, I see what you're saying. Yes, I think that makes a lot of sense!
Hi @patchcervan,
I've done some preliminary work on three functions: 1) abapToUnmarked_single
which formats data used for a single-season OM in both unmarked
and ubms
because they share the same format; 2) abapToUnmarked_multi
which formats for an multi-season occupancy OM in unmarked
; and 3) abapToSpOccupancy
which formats for a single season spatially explicit OM in spOccupancy
(the spatial coordinates come from calculating the centroids of focal pentads). One thing that I added for all three was the ability to extract survey-level covariates of hours and julian day that could be used to inform the detection part of the model. We can decide at a later stage whether this should be an optional argument but it's there to show what can be done (in my opinion the hours covariate should be included in all models!).
The question of how to replicate this functionality to format the data for multi-species models will depend on what format the api returns (i.e., what were talking about in #3)
I think this is a useful start and now is probably a good time to get your feedback. There is definitely room for improvement in some parts of the code. I haven't really looked at occuR
yet because I couldn't easily find documentation on the arguments used in the package.
The code includes testing the outputs on the OM functions in simple form just to make sure the outputs look sensible. Let me know what you think!
For ease of reference here are details of the packages:
ubms: Unmarked Bayesian Models with Stan
https://github.com/kenkellner/ubms
https://kenkellner.com/blog/ubms-vignette.html
devtools::install_github("kenkellner/ubms")
unmarked: An R package for analyzing ecological data
https://github.com/rbchan/unmarked
install.packages("unmarked")
spOccupancy: Fits single-species, multi-species, and integrated spatial occupancy models
https://github.com/doserjef/spOccupancy
install.packages("spOccupancy")
abapToUnmarked_single
library(ubms)
library(ABAP)
library(tidyverse)
library(unmarked)
library(lubridate)
## Extract ABAP data
spp <- 89 # Egyptian Goose
# spp <- 212 # RK Coot
abap_single <- getAbapData(spp, .region_type = "province",
.region = "Eastern Cape", .years = 2012)
abapToUnmarked_single <- function(abap_data){
## Compile metadata
pentad_id <- unique(abap_data$Pentad)
n_sites <- length(pentad_id)
max_cards <- max((group_by(abap_data,Pentad) %>% tally())$n)
## Create empty arrays
Y <- array(NA, dim = c(n_sites, max_cards))
dimnames(Y) <- list(unique(abap_data$Pentad),NULL)
obs_hours <- array(NA, dim = c(n_sites, max_cards))
dimnames(obs_hours) <- list(unique(abap_data$Pentad),NULL)
obs_jday <- array(NA, dim = c(n_sites, max_cards))
dimnames(obs_jday) <- list(unique(abap_data$Pentad),NULL)
## Check
# dim(Y); dimnames(Y)
# dim(obs_hours); dimnames(obs_jday)
# dim(obs_hours); dimnames(obs_jday)
## Populate detection/non-detection array (Y) and survey covariate arrays
for(i in seq_along(1:n_sites)){
dnd_vec <- abap_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(Spp)
dnd_vec <- case_when(dnd_vec == "-" ~ 0,
TRUE ~ 1)
Y[i, 1:length(dnd_vec)] <- dnd_vec
total_hours <- abap_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(TotalHours)
obs_hours[i, 1:length(total_hours)] <- total_hours
jday <- abap_data %>%
filter(Pentad == pentad_id[i]) %>%
mutate(julian_day = yday(StartDate)) %>%
pull(julian_day)
obs_jday[i, 1:length(jday)] <- jday
}
## Create named list of observation covariates
obs_covs <- list(hours = as.data.frame(obs_hours),
jday = as.data.frame(obs_jday))
## Create unmarked data frame
umf_abap <- unmarkedFrameOccu(y = Y, obsCovs = obs_covs, mapInfo = NULL)
return(umf_abap)
}
um_df <- abapToUnmarked_single(abap_single)
## Checks
plot(um_df)
summary(um_df)
## Fit single season unmarked with detection covariates
fm_um <- occu(~hours + jday ~1, um_df) # fit a model
fm_um
## Fit single season ubms with detection covariates
fm_ubms <- stan_occu(~hours + jday ~1, um_df, chains=3, cores=3, iter = 1000)
fm_ubms
## Checks
cbind(unmarked=coef(fm_um), stan=coef(fm_ubms))
## Extract ABAP data
# spp <- 89 # Egyptian Goose
spp <- 212 # RK Coot
abap_multi <- getAbapData(spp, .region_type = "province", .region = "Eastern Cape",
.years = c(2009,2010,2011,2012))
abapToUnmarked_multi <- function(abap_data){
## Compile metadata
# Note that this will process will only keep pentads
# sampled across all years. Need to check whether it is a
# valid approach to include sites that have incomplete data.
n_occasions <- length(unique(year(abap_data$StartDate)))
pentads_keep <- abap_data %>%
group_by(Pentad) %>%
summarise(occasions = length(unique(year(StartDate)))) %>%
filter(occasions == n_occasions) %>%
pull(Pentad)
abap_data <- abap_data %>%
filter(Pentad %in% pentads_keep)
pentad_id <- unique(abap_data$Pentad)
n_sites <- length(pentad_id)
max_visits <- max((group_by(abap_data, Pentad, year(StartDate)) %>% tally())$n)
occasion_vec <- paste0(" 0", 1:n_occasions)
years_vec <- sort(unique(year(abap_data$StartDate)))
## Create empty arrays
Y <- array(NA, dim = c(n_sites, max_visits, n_occasions),
dimnames = list(pentad_id, NULL, occasion_vec))
obs_hours <- array(NA, dim = c(n_sites, max_visits, n_occasions),
dimnames = list(pentad_id, NULL, occasion_vec))
obs_jday <- array(NA, dim = c(n_sites, max_visits, n_occasions),
dimnames = list(pentad_id, NULL, occasion_vec))
## Checks
# dim(Y)
# dimnames(Y)
## Populate detection/non-detection array (Y) and survey covariate arrays
# Y[sites, visits, years]
for(k in seq_along(years_vec)){
year_data <- abap_data %>%
filter(year(StartDate) == years_vec[k])
for(i in seq_along(1:n_sites)){
dnd_vec <- year_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(Spp)
dnd_vec <- case_when(dnd_vec == "-" ~ 0,
TRUE ~ 1)
Y[i, 1:length(dnd_vec), k] <- dnd_vec
total_hours <- year_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(TotalHours)
obs_hours[i, 1:length(total_hours), k] <- total_hours
jday <- year_data %>%
filter(Pentad == pentad_id[i]) %>%
mutate(julian_day = yday(StartDate)) %>%
pull(julian_day)
obs_jday[i, 1:length(jday), k] <- jday
}
}
## Checks
# Y[pentad_id[20],,]
# Y[pentad_id[5],,]
# obs_hours[pentad_id[20],,]
# obs_jday[pentad_id[20],,]
## Convert 3 dimensional array to matrix format
Y <- matrix(Y, n_sites, max_visits*n_occasions)
obs_hours <- matrix(obs_hours, n_sites, max_visits*n_occasions)
obs_jday <- matrix(obs_jday, n_sites, max_visits*n_occasions)
## Create year occasion variable
year <- matrix(occasion_vec, nrow(Y), n_occasions, byrow=TRUE)
## Checks
# head(Y)
# dim(Y)
# head(year)
## Create multi-season unmarked data frame
umf_abap <- unmarkedMultFrame(y = Y,
obsCovs = list(hours = obs_hours,
jday = obs_jday),
yearlySiteCovs = list(year = year),
numPrimary = n_occasions)
return(umf_abap)
}
# Generate multi-season occupancy model data frame
um_df <- abapToUnmarked_multi(abap_multi)
## Checks
summary(um_df)
## Fit constant parameter model
um_fm1 <- colext(psiformula= ~1,
gammaformula = ~ 1,
epsilonformula = ~ 1,
pformula = ~ 1,
data = um_df, method="BFGS")
summary(um_fm1)
# Model with gamma, epsilon and p year specific
um_fm2 <- colext(psiformula= ~1,
gammaformula = ~ year,
epsilonformula = ~ year,
pformula = ~ year,
data = um_df, method="BFGS")
summary(um_fm2)
# Model with gamma, epsilon year specific, and p year, hours and jday specific
um_fm3 <- colext(psiformula= ~1,
gammaformula = ~ year,
epsilonformula = ~ year,
pformula = ~ year + hours + jday,
data = um_df, method="BFGS")
summary(um_fm3)
## Extract ABAP data
spp <- 212 # RK Coot
abap_single <- getAbapData(spp, .region_type = "province",
.region = "Eastern Cape", .years = 2012)
abap_pentads <- getRegionPentads(.region_type = "province",
.region = "Eastern Cape")
abapToSpOccupancy <- function(abap_data, pentads){
## Compile metadata
pentad_id <- unique(abap_data$Pentad)
n_sites <- length(pentad_id)
max_cards <- max((group_by(abap_data,Pentad) %>% tally())$n)
## Extract pentad spatial data
pentads <- pentads %>%
filter(pentad %in% pentad_id)
pentad_xy <- st_coordinates(st_centroid(pentads))
## Checks
# head(pentad_xy)
# plot(pentads$geometry)
## Create empty arrays
Y <- array(NA, dim = c(n_sites, max_cards))
dimnames(Y) <- list(unique(abap_data$Pentad),NULL)
obs_hours <- array(NA, dim = c(n_sites, max_cards))
dimnames(obs_hours) <- list(unique(abap_data$Pentad),NULL)
obs_jday <- array(NA, dim = c(n_sites, max_cards))
dimnames(obs_jday) <- list(unique(abap_data$Pentad),NULL)
## Check
# dim(Y); dimnames(Y)
# dim(obs_hours); dimnames(obs_jday)
# dim(obs_hours); dimnames(obs_jday)
## Populate detection/non-detection array (Y) and survey covariate arrays
for(i in seq_along(1:n_sites)){
dnd_vec <- abap_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(Spp)
dnd_vec <- case_when(dnd_vec == "-" ~ 0,
TRUE ~ 1)
Y[i, 1:length(dnd_vec)] <- dnd_vec
total_hours <- abap_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(TotalHours)
obs_hours[i, 1:length(total_hours)] <- total_hours
jday <- abap_data %>%
filter(Pentad == pentad_id[i]) %>%
mutate(julian_day = yday(StartDate)) %>%
pull(julian_day)
obs_jday[i, 1:length(jday)] <- jday
}
spocc_df <- list(Y, pentad_xy, list(hours = obs_hours, jday = obs_jday))
names(spocc_df) <- c("y", "coords", "det.covs")
return(spocc_df)
}
spo_data <- abapToSpOccupancy(abap_single, abap_pentads)
## Run the spatial model
spocc_fm <- spPGOcc(occ.formula = ~1,
det.formula = ~ hours + jday,
data = spo_data,
n.batch = 400, batch.length = 25,
accept.rate = 0.43, cov.model = "exponential",
NNGP = TRUE, n.neighbors = 5, n.burn = 2000,
n.thin = 4, n.chains = 3, verbose = FALSE, k.fold = 2)
summary(spocc_fm)
Thanks a lot Dom! Looking great at a first impression. I need to test the code a bit more slowly and I am currently a bit chocked. Also still waiting for @MichaelBrooks-uct feedback. I will hopefully be able to incorporate this sometime next week...
Hey @patchcervan, an update from my side. Coincidentally had some correspondence with Greg about the multi-season models and realised that it's fine to have missing site data across years (just how much can be missing before the model goes awry is another question though). So I've updated the abapToUnmarked_multi
to remove the pentad filtering and include a control flow that jumps to the next iteration if pentad data in a particular year are missing. See below.
## Extract ABAP data
# spp <- 89 # Egyptian Goose
spp <- 212 # RK Coot
abap_multi <- getAbapData(spp, .region_type = "province", .region = "Eastern Cape",
.years = c(2009,2010,2011,2012))
abapToUnmarked_multi <- function(abap_data){
n_occasions <- length(unique(year(abap_data$StartDate)))
pentad_id <- unique(abap_data$Pentad)
n_sites <- length(pentad_id)
max_visits <- max((group_by(abap_data, Pentad, year(StartDate)) %>% tally())$n)
occasion_vec <- paste0(" 0", 1:n_occasions)
years_vec <- sort(unique(year(abap_data$StartDate)))
## Create empty arrays
Y <- array(NA, dim = c(n_sites, max_visits, n_occasions),
dimnames = list(pentad_id, NULL, occasion_vec))
obs_hours <- array(NA, dim = c(n_sites, max_visits, n_occasions),
dimnames = list(pentad_id, NULL, occasion_vec))
obs_jday <- array(NA, dim = c(n_sites, max_visits, n_occasions),
dimnames = list(pentad_id, NULL, occasion_vec))
## Populate detection/non-detection array (Y) and survey covariate arrays
# Y[sites, visits, years]
for(k in seq_along(years_vec)){
year_data <- abap_data %>%
filter(year(StartDate) == years_vec[k])
for(i in seq_along(1:n_sites)){
dnd_vec <- year_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(Spp)
if(length(dnd_vec) == 0){
next
}
dnd_vec <- case_when(dnd_vec == "-" ~ 0,
TRUE ~ 1)
Y[i, 1:length(dnd_vec), k] <- dnd_vec
total_hours <- year_data %>%
filter(Pentad == pentad_id[i]) %>%
pull(TotalHours)
obs_hours[i, 1:length(total_hours), k] <- total_hours
jday <- year_data %>%
filter(Pentad == pentad_id[i]) %>%
mutate(julian_day = yday(StartDate)) %>%
pull(julian_day)
obs_jday[i, 1:length(jday), k] <- jday
}
}
## Convert 3 dimensional array to matrix format
Y <- matrix(Y, n_sites, max_visits*n_occasions)
obs_hours <- matrix(obs_hours, n_sites, max_visits*n_occasions)
obs_jday <- matrix(obs_jday, n_sites, max_visits*n_occasions)
## Create year occasion variable
year <- matrix(occasion_vec, nrow(Y), n_occasions, byrow=TRUE)
## Create multi-season unmarked data frame
umf_abap <- unmarkedMultFrame(y = Y,
obsCovs = list(hours = obs_hours,
jday = obs_jday),
yearlySiteCovs = list(year = year),
numPrimary = n_occasions)
return(umf_abap)
}
# Generate multi-season occupancy model data frame
um_df <- abapToUnmarked_multi(abap_multi)
## Checks
summary(um_df)
## Fit constant parameter model
um_fm1 <- colext(psiformula= ~1,
gammaformula = ~ 1,
epsilonformula = ~ 1,
pformula = ~ 1,
data = um_df, method="BFGS")
summary(um_fm1)
# Model with gamma, epsilon and p year specific
um_fm2 <- colext(psiformula= ~1,
gammaformula = ~ year,
epsilonformula = ~ year,
pformula = ~ year,
data = um_df, method="BFGS")
summary(um_fm2)
# Model with gamma, epsilon year specific, and p year, hours and jday specific
um_fm3 <- colext(psiformula= ~1,
gammaformula = ~ year,
epsilonformula = ~ year,
pformula = ~ year + hours + jday,
data = um_df, method="BFGS")
summary(um_fm3)
I've opened a branch to work on these functions. I just committed abapToUnmarked_multi
. I've modified a couple of things to bring it to the general format of the ABAP package (good or bad! xD). Would you mind double checking that I didn't break anything? I've made a couple of tests and it seems to work. One thing I've noticed is that I haven't used the package ubms
for anything, which makes me a bit suspicious. Is there any specific function that comes from that package? It is one of the reasons why I am explicit about package calls, although it is a bit of a pain, it helps keep track of things.
Everything looks good with abapToUnmarked_multi
from my side! The modifications have definitely been for the better. I can't believe I've never used dplyr::count()
before!
Re ubms
: functions from the package aren't used in the data formatting we're doing here. The occupancy modelling functions in ubms
take the same objects as those generated by unmarked::unmarkedMultFrame()
but I've included the ubms::stan_occu()
function in the testing code above just to make sure that our function outputs data that works across both packages.
What are your thoughts in regards to the other two functions? Would you like me to write them up within the branch and submit via PRs?
Great, happy that I didn't break anything, thanks @DomHenry! If you could check the format I gave to abapToUnmarked_multi
, adapt the other functions and create a PR, it would be great, yes. Many thanks!
Hi @DomHenry, thanks for the PR, I've merged them to the formats branch for now. I'll work on writing that main formatting function that calls all the other package-specific ones before merging with the main branch. I just want to avoid having users starting to use the individual functions and then breaking their code when we switch to the single function.
Hey @patchcervan, sorry for the misunderstanding - I should have directed the PRs at the formats branch. Hope it wasn't too much of a headache to merge them. I kinda forgot that we were going to call the helper functions and have one overarching user-facing function.
No problem, I didn't know how to change the base PR are directed to, but it turned out to be pretty simple! I quite like the idea of a user-facing function, because we can always rearrange things inside this function without breaking anybody's code and probably it would make people lives easier.
Hey @DomHenry, I am actually starting to rethink the user-facing function thing. It is mainly because of the documentation. You have created nice documentation for the individual package functions and putting all these under a user-facing function would make us combine all documentation on a single file, which might be a bit too much? Also it seems like individual package functions are quite elaborated to stand on their own? Thoughts?
Hi @DomHenry, please let me know when you are finished with these first data structures. We can then merge with the main branch and close this issue :)
Hey @patchcervan, just completed the final PR dealing with multi-season data (#23). So once you're happy with that then there shouldn't be anything else to add at this stage.
Thanks @DomHenry! The new functions are available on the new minor version update (#26)
The ABAP data are obviously very useful for developing occupancy models. One potentially tricky step is converting the output of
getAbapData
to a detection/non-detection matrix that can be used as an input into functions from the various occupancy modelling R packages (e.g.,unmarked
,spOccupancy
,ubms
,occuR
).It would be useful to have a function that takes the output from the
getAbapData
call and then converts it to a specified format based on one of the packages mentioned above. There may also be a need to have the formats customised for some of the functions within each package (e.g, when one is using single or multi-species data).If you think a function like this would be a useful addition to
ABAP
then I would be happy to work on this and put something together?