Create detection/non-detection data structures

DomHenry commented 2 years ago

The ABAP data are obviously very useful for developing occupancy models. One potentially tricky step is converting the output of getAbapData to a detection/non-detection matrix that can be used as an input into functions from the various occupancy modelling R packages (e.g., unmarked, spOccupancy, ubms, occuR).

It would be useful to have a function that takes the output from the getAbapData call and then converts it to a specified format based on one of the packages mentioned above. There may also be a need to have the formats customised for some of the functions within each package (e.g, when one is using single or multi-species data).

If you think a function like this would be a useful addition to ABAP then I would be happy to work on this and put something together?

patchcervan commented 2 years ago

That would be great thanks @DomHenry! We would probably need one function for each package e.g. abapToUnmarked, abapToSpOccupancy, etc.? Or do you think we could come up with a common structure for all of them? I think unmarked's format is quite different from the others isn't it?

DomHenry commented 2 years ago

Sure thing @patchcervan! I'm not sure about a common structure that can be used for all of the packages. I suspect it may not be possible. Instead of multiple functions we may want to have a single function with an argument that takes the name of the occupancy modelling package and returns a package-specific object? In any case I'll start by looking at the input format across the various packages and see what level of agreement/disagreement we're working with.

patchcervan commented 2 years ago

That would be great thanks Dom. I am thinking, perhaps for tidiness we can meet in the middle, and create a single "user-facing" function and a bunch of helpers that take care of formatting the output for different packages.

DomHenry commented 2 years ago

Ah, I see what you're saying. Yes, I think that makes a lot of sense!

DomHenry commented 2 years ago

Hi @patchcervan,

I've done some preliminary work on three functions: 1) abapToUnmarked_single which formats data used for a single-season OM in both unmarked and ubms because they share the same format; 2) abapToUnmarked_multi which formats for an multi-season occupancy OM in unmarked; and 3) abapToSpOccupancy which formats for a single season spatially explicit OM in spOccupancy (the spatial coordinates come from calculating the centroids of focal pentads). One thing that I added for all three was the ability to extract survey-level covariates of hours and julian day that could be used to inform the detection part of the model. We can decide at a later stage whether this should be an optional argument but it's there to show what can be done (in my opinion the hours covariate should be included in all models!).

The question of how to replicate this functionality to format the data for multi-species models will depend on what format the api returns (i.e., what were talking about in #3)

I think this is a useful start and now is probably a good time to get your feedback. There is definitely room for improvement in some parts of the code. I haven't really looked at occuR yet because I couldn't easily find documentation on the arguments used in the package.

The code includes testing the outputs on the OM functions in simple form just to make sure the outputs look sensible. Let me know what you think!

For ease of reference here are details of the packages:

ubms: Unmarked Bayesian Models with Stan https://github.com/kenkellner/ubms https://kenkellner.com/blog/ubms-vignette.html devtools::install_github("kenkellner/ubms")

unmarked: An R package for analyzing ecological data https://github.com/rbchan/unmarked install.packages("unmarked")

spOccupancy: Fits single-species, multi-species, and integrated spatial occupancy models https://github.com/doserjef/spOccupancy install.packages("spOccupancy")

`abapToUnmarked_single`

library(ubms)
library(ABAP)
library(tidyverse)
library(unmarked)
library(lubridate)

## Extract ABAP data
spp <- 89  # Egyptian Goose
# spp <- 212 # RK Coot

abap_single <- getAbapData(spp, .region_type = "province",
                         .region = "Eastern Cape", .years = 2012)

abapToUnmarked_single <- function(abap_data){

    ## Compile metadata
    pentad_id <- unique(abap_data$Pentad)
    n_sites <- length(pentad_id)
    max_cards <- max((group_by(abap_data,Pentad) %>% tally())$n)

    ## Create empty arrays
    Y <-  array(NA, dim = c(n_sites, max_cards))
    dimnames(Y) <- list(unique(abap_data$Pentad),NULL)

    obs_hours <-  array(NA, dim = c(n_sites, max_cards))
    dimnames(obs_hours) <- list(unique(abap_data$Pentad),NULL)

    obs_jday <-  array(NA, dim = c(n_sites, max_cards))
    dimnames(obs_jday) <- list(unique(abap_data$Pentad),NULL)

    ## Check
    # dim(Y); dimnames(Y)
    # dim(obs_hours); dimnames(obs_jday)
    # dim(obs_hours); dimnames(obs_jday)

    ## Populate detection/non-detection array (Y) and survey covariate arrays
    for(i in seq_along(1:n_sites)){

        dnd_vec <- abap_data %>%
            filter(Pentad == pentad_id[i]) %>%
            pull(Spp)

        dnd_vec <- case_when(dnd_vec == "-" ~ 0,
                             TRUE ~ 1)

        Y[i, 1:length(dnd_vec)] <- dnd_vec

        total_hours <- abap_data %>%
            filter(Pentad == pentad_id[i]) %>%
            pull(TotalHours)

        obs_hours[i, 1:length(total_hours)] <- total_hours

        jday <- abap_data %>%
            filter(Pentad == pentad_id[i]) %>%
            mutate(julian_day = yday(StartDate)) %>%
            pull(julian_day)

        obs_jday[i, 1:length(jday)] <- jday

    }

    ## Create named list of observation covariates
    obs_covs <- list(hours = as.data.frame(obs_hours),
                     jday = as.data.frame(obs_jday))

    ## Create unmarked data frame
    umf_abap <- unmarkedFrameOccu(y = Y, obsCovs = obs_covs, mapInfo = NULL)

    return(umf_abap)

}

um_df <- abapToUnmarked_single(abap_single)

## Checks
plot(um_df)
summary(um_df)

## Fit single season unmarked with detection covariates
fm_um <- occu(~hours + jday ~1, um_df)  # fit a model
fm_um

## Fit single season ubms with detection covariates
fm_ubms <- stan_occu(~hours + jday ~1, um_df, chains=3, cores=3, iter = 1000)
fm_ubms

## Checks
cbind(unmarked=coef(fm_um), stan=coef(fm_ubms))

abapToUnmarked_multi

## Extract ABAP data

# spp <- 89  # Egyptian Goose
spp <- 212 # RK Coot

abap_multi <- getAbapData(spp, .region_type = "province", .region = "Eastern Cape",
                         .years = c(2009,2010,2011,2012))

abapToUnmarked_multi <- function(abap_data){

    ## Compile metadata
    # Note that this will process will only keep pentads
    # sampled across all years. Need to check whether it is a
    # valid approach to include sites that have incomplete data.

    n_occasions <- length(unique(year(abap_data$StartDate)))

    pentads_keep <- abap_data %>%
        group_by(Pentad) %>%
        summarise(occasions = length(unique(year(StartDate)))) %>%
        filter(occasions == n_occasions) %>%
        pull(Pentad)

    abap_data <- abap_data %>%
        filter(Pentad %in% pentads_keep)

    pentad_id <- unique(abap_data$Pentad)
    n_sites <- length(pentad_id)
    max_visits <- max((group_by(abap_data, Pentad, year(StartDate)) %>% tally())$n)
    occasion_vec <- paste0(" 0", 1:n_occasions)
    years_vec <- sort(unique(year(abap_data$StartDate)))

    ## Create empty arrays
    Y <- array(NA, dim = c(n_sites, max_visits, n_occasions),
               dimnames = list(pentad_id, NULL, occasion_vec))

    obs_hours <- array(NA, dim = c(n_sites, max_visits, n_occasions),
                       dimnames = list(pentad_id, NULL, occasion_vec))

    obs_jday <-  array(NA, dim = c(n_sites, max_visits, n_occasions),
                       dimnames = list(pentad_id, NULL, occasion_vec))

    ## Checks
    # dim(Y)
    # dimnames(Y)

    ## Populate detection/non-detection array (Y) and survey covariate arrays
    # Y[sites, visits, years]

    for(k in seq_along(years_vec)){

        year_data <- abap_data %>%
            filter(year(StartDate) == years_vec[k])

        for(i in seq_along(1:n_sites)){

            dnd_vec <- year_data %>%
                filter(Pentad == pentad_id[i]) %>%
                pull(Spp)

            dnd_vec <- case_when(dnd_vec == "-" ~ 0,
                                 TRUE ~ 1)

            Y[i, 1:length(dnd_vec), k] <- dnd_vec

            total_hours <- year_data %>%
                filter(Pentad == pentad_id[i]) %>%
                pull(TotalHours)

            obs_hours[i, 1:length(total_hours), k] <- total_hours

            jday <- year_data %>%
                filter(Pentad == pentad_id[i]) %>%
                mutate(julian_day = yday(StartDate)) %>%
                pull(julian_day)

            obs_jday[i, 1:length(jday), k] <- jday

        }
    }

    ## Checks
    # Y[pentad_id[20],,]
    # Y[pentad_id[5],,]
    # obs_hours[pentad_id[20],,]
    # obs_jday[pentad_id[20],,]

    ## Convert 3 dimensional array to matrix format
    Y <- matrix(Y, n_sites, max_visits*n_occasions)
    obs_hours <-  matrix(obs_hours, n_sites, max_visits*n_occasions)
    obs_jday <-  matrix(obs_jday, n_sites, max_visits*n_occasions)

    ## Create year occasion variable
    year <- matrix(occasion_vec, nrow(Y), n_occasions, byrow=TRUE)

    ## Checks
    # head(Y)
    # dim(Y)
    # head(year)

    ## Create multi-season unmarked data frame
    umf_abap <- unmarkedMultFrame(y = Y,
                                  obsCovs = list(hours = obs_hours,
                                                 jday = obs_jday),
                                  yearlySiteCovs = list(year = year),
                                  numPrimary = n_occasions)

    return(umf_abap)
}

# Generate multi-season occupancy model data frame
um_df <- abapToUnmarked_multi(abap_multi)

## Checks
summary(um_df)

## Fit constant parameter model
um_fm1 <- colext(psiformula= ~1,
                   gammaformula = ~ 1,
                   epsilonformula = ~ 1,
                   pformula = ~ 1,
                   data = um_df, method="BFGS")

summary(um_fm1)

# Model with gamma, epsilon and p year specific
um_fm2 <- colext(psiformula= ~1,
                 gammaformula = ~ year,
                 epsilonformula = ~ year,
                 pformula = ~ year,
                 data = um_df, method="BFGS")
summary(um_fm2)

# Model with gamma, epsilon year specific,  and p year, hours and jday specific
um_fm3 <- colext(psiformula= ~1,
                 gammaformula = ~ year,
                 epsilonformula = ~ year,
                 pformula = ~ year + hours + jday,
                 data = um_df, method="BFGS")
summary(um_fm3)

abapToSpOccupancy

## Extract ABAP data
spp <- 212 # RK Coot
abap_single <- getAbapData(spp, .region_type = "province",
                           .region = "Eastern Cape", .years = 2012)

abap_pentads <- getRegionPentads(.region_type = "province",
                                 .region = "Eastern Cape")

abapToSpOccupancy <- function(abap_data, pentads){

    ## Compile metadata
    pentad_id <- unique(abap_data$Pentad)
    n_sites <- length(pentad_id)
    max_cards <- max((group_by(abap_data,Pentad) %>% tally())$n)

    ## Extract pentad spatial data
    pentads <- pentads %>%
        filter(pentad %in% pentad_id)

    pentad_xy <- st_coordinates(st_centroid(pentads))

    ## Checks
    # head(pentad_xy)
    # plot(pentads$geometry)

    ## Create empty arrays
    Y <-  array(NA, dim = c(n_sites, max_cards))
    dimnames(Y) <- list(unique(abap_data$Pentad),NULL)

    obs_hours <-  array(NA, dim = c(n_sites, max_cards))
    dimnames(obs_hours) <- list(unique(abap_data$Pentad),NULL)

    obs_jday <-  array(NA, dim = c(n_sites, max_cards))
    dimnames(obs_jday) <- list(unique(abap_data$Pentad),NULL)

    ## Check
    # dim(Y); dimnames(Y)
    # dim(obs_hours); dimnames(obs_jday)
    # dim(obs_hours); dimnames(obs_jday)

    ## Populate detection/non-detection array (Y) and survey covariate arrays
    for(i in seq_along(1:n_sites)){

        dnd_vec <- abap_data %>%
            filter(Pentad == pentad_id[i]) %>%
            pull(Spp)

        dnd_vec <- case_when(dnd_vec == "-" ~ 0,
                             TRUE ~ 1)

        Y[i, 1:length(dnd_vec)] <- dnd_vec

        total_hours <- abap_data %>%
            filter(Pentad == pentad_id[i]) %>%
            pull(TotalHours)

        obs_hours[i, 1:length(total_hours)] <- total_hours

        jday <- abap_data %>%
            filter(Pentad == pentad_id[i]) %>%
            mutate(julian_day = yday(StartDate)) %>%
            pull(julian_day)

        obs_jday[i, 1:length(jday)] <- jday

    }

    spocc_df <- list(Y, pentad_xy, list(hours = obs_hours, jday = obs_jday))
    names(spocc_df) <- c("y", "coords", "det.covs")
    return(spocc_df)

}

spo_data <- abapToSpOccupancy(abap_single, abap_pentads)

## Run the spatial model
spocc_fm <- spPGOcc(occ.formula = ~1,
                 det.formula = ~ hours + jday,
                 data = spo_data,
                 n.batch = 400, batch.length = 25,
                 accept.rate = 0.43, cov.model = "exponential",
                 NNGP = TRUE, n.neighbors = 5, n.burn = 2000,
                 n.thin = 4, n.chains = 3, verbose = FALSE, k.fold = 2)

summary(spocc_fm)

patchcervan commented 2 years ago

Thanks a lot Dom! Looking great at a first impression. I need to test the code a bit more slowly and I am currently a bit chocked. Also still waiting for @MichaelBrooks-uct feedback. I will hopefully be able to incorporate this sometime next week...

DomHenry commented 2 years ago

Hey @patchcervan, an update from my side. Coincidentally had some correspondence with Greg about the multi-season models and realised that it's fine to have missing site data across years (just how much can be missing before the model goes awry is another question though). So I've updated the abapToUnmarked_multi to remove the pentad filtering and include a control flow that jumps to the next iteration if pentad data in a particular year are missing. See below.

abapToUnmarked_multi

## Extract ABAP data
# spp <- 89  # Egyptian Goose
spp <- 212 # RK Coot

abap_multi <- getAbapData(spp, .region_type = "province", .region = "Eastern Cape",
                         .years = c(2009,2010,2011,2012))

abapToUnmarked_multi <- function(abap_data){

    n_occasions <- length(unique(year(abap_data$StartDate)))
    pentad_id <- unique(abap_data$Pentad)
    n_sites <- length(pentad_id)
    max_visits <- max((group_by(abap_data, Pentad, year(StartDate)) %>% tally())$n)
    occasion_vec <- paste0(" 0", 1:n_occasions)
    years_vec <- sort(unique(year(abap_data$StartDate)))

    ## Create empty arrays
    Y <- array(NA, dim = c(n_sites, max_visits, n_occasions),
               dimnames = list(pentad_id, NULL, occasion_vec))

    obs_hours <- array(NA, dim = c(n_sites, max_visits, n_occasions),
                       dimnames = list(pentad_id, NULL, occasion_vec))

    obs_jday <-  array(NA, dim = c(n_sites, max_visits, n_occasions),
                       dimnames = list(pentad_id, NULL, occasion_vec))

    ## Populate detection/non-detection array (Y) and survey covariate arrays
    # Y[sites, visits, years]

    for(k in seq_along(years_vec)){

        year_data <- abap_data %>%
            filter(year(StartDate) == years_vec[k])

        for(i in seq_along(1:n_sites)){

            dnd_vec <- year_data %>%
                filter(Pentad == pentad_id[i]) %>%
                pull(Spp)

            if(length(dnd_vec) == 0){
                next
            }

            dnd_vec <- case_when(dnd_vec == "-" ~ 0,
                                 TRUE ~ 1)

            Y[i, 1:length(dnd_vec), k] <- dnd_vec

            total_hours <- year_data %>%
                filter(Pentad == pentad_id[i]) %>%
                pull(TotalHours)

            obs_hours[i, 1:length(total_hours), k] <- total_hours

            jday <- year_data %>%
                filter(Pentad == pentad_id[i]) %>%
                mutate(julian_day = yday(StartDate)) %>%
                pull(julian_day)

            obs_jday[i, 1:length(jday), k] <- jday

        }
    }

    ## Convert 3 dimensional array to matrix format
    Y <- matrix(Y, n_sites, max_visits*n_occasions)
    obs_hours <-  matrix(obs_hours, n_sites, max_visits*n_occasions)
    obs_jday <-  matrix(obs_jday, n_sites, max_visits*n_occasions)

    ## Create year occasion variable
    year <- matrix(occasion_vec, nrow(Y), n_occasions, byrow=TRUE)

    ## Create multi-season unmarked data frame
    umf_abap <- unmarkedMultFrame(y = Y,
                                  obsCovs = list(hours = obs_hours,
                                                 jday = obs_jday),
                                  yearlySiteCovs = list(year = year),
                                  numPrimary = n_occasions)

    return(umf_abap)
}

# Generate multi-season occupancy model data frame
um_df <- abapToUnmarked_multi(abap_multi)

## Checks
summary(um_df)

## Fit constant parameter model
um_fm1 <- colext(psiformula= ~1,
                   gammaformula = ~ 1,
                   epsilonformula = ~ 1,
                   pformula = ~ 1,
                   data = um_df, method="BFGS")

summary(um_fm1)

# Model with gamma, epsilon and p year specific
um_fm2 <- colext(psiformula= ~1,
                 gammaformula = ~ year,
                 epsilonformula = ~ year,
                 pformula = ~ year,
                 data = um_df, method="BFGS")
summary(um_fm2)

# Model with gamma, epsilon year specific,  and p year, hours and jday specific
um_fm3 <- colext(psiformula= ~1,
                 gammaformula = ~ year,
                 epsilonformula = ~ year,
                 pformula = ~ year + hours + jday,
                 data = um_df, method="BFGS")
summary(um_fm3)

patchcervan commented 2 years ago

I've opened a branch to work on these functions. I just committed abapToUnmarked_multi. I've modified a couple of things to bring it to the general format of the ABAP package (good or bad! xD). Would you mind double checking that I didn't break anything? I've made a couple of tests and it seems to work. One thing I've noticed is that I haven't used the package ubms for anything, which makes me a bit suspicious. Is there any specific function that comes from that package? It is one of the reasons why I am explicit about package calls, although it is a bit of a pain, it helps keep track of things.

DomHenry commented 2 years ago

Everything looks good with abapToUnmarked_multi from my side! The modifications have definitely been for the better. I can't believe I've never used dplyr::count() before!

Re ubms: functions from the package aren't used in the data formatting we're doing here. The occupancy modelling functions in ubms take the same objects as those generated by unmarked::unmarkedMultFrame() but I've included the ubms::stan_occu() function in the testing code above just to make sure that our function outputs data that works across both packages.

What are your thoughts in regards to the other two functions? Would you like me to write them up within the branch and submit via PRs?

patchcervan commented 2 years ago

Great, happy that I didn't break anything, thanks @DomHenry! If you could check the format I gave to abapToUnmarked_multi, adapt the other functions and create a PR, it would be great, yes. Many thanks!

patchcervan commented 2 years ago

Hi @DomHenry, thanks for the PR, I've merged them to the formats branch for now. I'll work on writing that main formatting function that calls all the other package-specific ones before merging with the main branch. I just want to avoid having users starting to use the individual functions and then breaking their code when we switch to the single function.

DomHenry commented 2 years ago

Hey @patchcervan, sorry for the misunderstanding - I should have directed the PRs at the formats branch. Hope it wasn't too much of a headache to merge them. I kinda forgot that we were going to call the helper functions and have one overarching user-facing function.

patchcervan commented 2 years ago

No problem, I didn't know how to change the base PR are directed to, but it turned out to be pretty simple! I quite like the idea of a user-facing function, because we can always rearrange things inside this function without breaking anybody's code and probably it would make people lives easier.

patchcervan commented 2 years ago

Hey @DomHenry, I am actually starting to rethink the user-facing function thing. It is mainly because of the documentation. You have created nice documentation for the individual package functions and putting all these under a user-facing function would make us combine all documentation on a single file, which might be a bit too much? Also it seems like individual package functions are quite elaborated to stand on their own? Thoughts?

patchcervan commented 2 years ago

Hi @DomHenry, please let me know when you are finished with these first data structures. We can then merge with the main branch and close this issue :)

DomHenry commented 2 years ago

Hey @patchcervan, just completed the final PR dealing with multi-season data (#23). So once you're happy with that then there shouldn't be anything else to add at this stage.

patchcervan commented 2 years ago

Thanks @DomHenry! The new functions are available on the new minor version update (#26)

AfricaBirdData / ABAP