CornellLabofOrnithology / auk

Working with eBird data in R
https://CornellLabofOrnithology.github.io/auk/
GNU General Public License v3.0
137 stars 22 forks source link

Question about format_unmarked_occu #53

Closed mickayla32 closed 3 years ago

mickayla32 commented 3 years ago

Hi! I'm trying to run occupancy modeling following the eBird best practices methods, but I am using data split up by ecoregion. I realize that this is not the typical way to conduct occupancy modeling while following these eBird best practices methods, but I was hoping to use eBird data from 2015-2019, and land cover data from just 2019 to conduct a single season occupancy model by ecoregion (though I realize that typically a single-season occupancy model uses data from a single year). All of my code is essentially the same up to section 5.2 in the eBird best practices--

#load data
ebird_habitat <- read_csv("data/ebird_habitat_2015-2020.csv")

# Filter prior to creating occupancy model data-- going to keep this the same
# Getting rid of last line, in order to keep all ebird data
ebird_filtered <- filter(ebird_habitat, 
                         number_observers <= 5)
                         #year == max(year))

ebird_filtered <- ebird_filtered %>%
  filter(observation_date >= "2019-04-01" & observation_date <= "2019-08-01" |
           observation_date >= "2018-04-01" & observation_date <= "2018-08-01" |
           observation_date >= "2017-04-01" & observation_date <= "2017-08-01" |
           observation_date >= "2016-04-01" & observation_date <= "2016-08-01" |
           observation_date >= "2015-04-01" & observation_date <= "2015-08-01")

# CHANGED ANNUAL_CLOSURE = TRUE TO FALSE
# And needed to add n_days = 610, because I want to do a single season occupancy model using data from all the years 
occ <- filter_repeat_visits(ebird_filtered, 
                            min_obs = 2, max_obs = 10,
                            annual_closure = FALSE,
                            n_days = 610,
                            date_var = "observation_date",
                            site_vars = c("locality_id", "observer_id"))

I then filtered my data by ecoregion type, starting with the first ecoregion:

#### GREAT PLAINS ECOREGION MODEL ####
occ_great_plains <- occ %>%
  filter(ecoregion_broad == "GREAT PLAINS")

When I get to the format_unmarked_occu() function:

occ_wide_gp <- format_unmarked_occu(occ_great_plains, 
                                 site_id = "site", 
                                 response = "species_observed",
                                 # ecological processes / occupancy covs:
                                 site_covs = c("n_observations", 
                                               "latitude", "longitude", 
                                               "pland_04_deciduous_broadleaf", 
                                               "pland_05_mixed_forest",
                                               "pland_10_grassland",
                                               "pland_11_wetland",
                                               "pland_12_cropland",
                                               "pland_07_open_shrubland",
                                               "pland_13_urban",
                                               "pland_15_barren",
                                               "pland_09_savanna",
                                               "pland_14_mosiac",
                                               "pland_08_woody_savanna"),
                                 # observational processes / obs covs:
                                 obs_covs = c("time_observations_started", 
                                              "duration_minutes", 
                                              "effort_distance_km", 
                                              "number_observers", 
                                              "protocol_type",
                                              "pland_04_deciduous_broadleaf", 
                                              "pland_05_mixed_forest",
                                              "pland_11_wetland"))

I get the error:

Error in format_unmarked_occu(occ_great_plains, site_id = "site", response = "species_observed",  :
  Site-level covariates must be constant across sites

I tried looking at the suggestions in this post, but have not had any luck. So, I tried using the code from the function itself to remove any rows that contain landcover types that are not constant across sites:

x <- occ_great_plains

# assign observation ids within sites
x <- dplyr::group_by_at(x, "site")
x <- dplyr::mutate(x, .obs_id = dplyr::row_number())
x <- dplyr::ungroup(x)

# response to wide
x_resp <- dplyr::select(x, !!rlang::sym("site"), .data$.obs_id, 
                        !!rlang::sym("species_observed"))
x_resp <- tidyr::spread(x_resp, .data$.obs_id, !!rlang::sym("species_observed"))
names(x_resp)[-1] <- paste("y", names(x_resp)[-1], sep = ".")

site_covs <- c("n_observations", 
              "latitude", "longitude", 
              "pland_00_water",
              "pland_01_evergreen_needleleaf", 
              "pland_02_evergreen_broadleaf",
              "pland_03_deciduous_needleleaf",
              "pland_04_deciduous_broadleaf", 
              "pland_05_mixed_forest",
              "pland_06_closed_shrubland",
              "pland_07_open_shrubland",
              "pland_08_woody_savanna",
              "pland_09_savanna",
              "pland_10_grassland",
              "pland_11_wetland",
              "pland_12_cropland",
              "pland_13_urban",
              "pland_14_mosiac",
              "pland_15_barren")

# site-level covariates
x_site <- dplyr::select(x, !!rlang::sym("site"), !!!rlang::syms(site_covs))
# collapse to one row per site
x_site <- dplyr::group_by_at(x_site, "site")
x_site <- dplyr::distinct(x_site)
# check covariates are constant across site
n_unique <- dplyr::count(dplyr::distinct(x_site))$n

# remove each of these rows that are in which(n_unique != 1)
rows_to_remove <- which(n_unique != 1)
test <- occ_great_plains[-rows_to_remove,]

# check--
nrow(occ_great_plains)
nrow(test)
301425 - 290772
# = 10653, which is how long the vector rows_to_remove is, so I think it worked... trying with format_unmarked_occu--

occ_great_plains_2 <- test

# format for unmarked
occ_wide_gp <- format_unmarked_occu(occ_great_plains_2, 
                                 site_id = "site", 
                                 response = "species_observed",
                                 # ecological processes / occupancy covs:
                                 site_covs = c("n_observations", 
                                               "latitude", "longitude", 
                                               "pland_04_deciduous_broadleaf", 
                                               "pland_05_mixed_forest",
                                               "pland_10_grassland",
                                               "pland_11_wetland",
                                               "pland_12_cropland",
                                               "pland_07_open_shrubland",
                                               "pland_13_urban",
                                               "pland_15_barren",
                                               "pland_09_savanna",
                                               "pland_14_mosiac",
                                               "pland_08_woody_savanna"),
                                 # observational processes / obs covs:
                                 obs_covs = c("time_observations_started", 
                                              "duration_minutes", 
                                              "effort_distance_km", 
                                              "number_observers", 
                                              "protocol_type",
                                              "pland_04_deciduous_broadleaf", 
                                              "pland_05_mixed_forest",
                                              "pland_11_wetland"))

But this produces the same error... Is there any way to remove the rows (sites) that are causing this error before running the format_unmarked_occu function?

I can provide my ebird_habitat .csv file if needed!

mstrimas commented 3 years ago

It's been a long time since I've wrote these functions, so I'm not sure how much help I can be, but if you send me the ebird_habitat.csv to mes335@cornell.edu I can take a look

lime-n commented 3 years ago

I remember having this error when I was doing something similar, I think the issue is with the .csv file, it most likely has duplicates. If you were following part 3 of the ebird best practices, then I can probably tell you where this error is coming from.

mickayla32 commented 3 years ago

@lime-n I actually figured it out-- it had to do with using more than 1 year of land cover data. Thank you though!

vjjan91 commented 3 years ago

@lime-n @mstrimas @mickayla32 Still facing a similar issue. I am unsure of how the dataframe could potentially have different values, given the process of covariate preparation. Any suggestions?