CornellLabofOrnithology / auk

Working with eBird data in R
https://CornellLabofOrnithology.github.io/auk/
GNU General Public License v3.0
136 stars 20 forks source link

Error: Some checklists in EBD are missing from sampling event data. #46

Open sofbol94 opened 3 years ago

sofbol94 commented 3 years ago

Hello,

I'm new to auk, and working with data for Great Green Macaws to estimate presence/absence in different seasons. I've filtered my ebd and sampling event data to Costa Rica and then attempted to zero fill these. Since i'm working with a sensitive species i'm using a customized EBD. However, I am getting an error that there are some checklists in the EBD that are missing in the sampling data. i tryied to filter for the last edited date to exclude checklist that were added after Mar2020. Here is my code:

library(auk)
library(tidyverse)

f_ebd <- "~/data/ebd_GGM.txt"
f_sed <- "~/data/sed_GGM.txt"
ebd_2020_GGMA <- auk_ebd("ebd_sensitive_relMar-2020.txt", 
                      file_sampling = "ebd_sampling_relAug-2020.txt") %>% 
  auk_species("Great Green Macaw") %>% 
  auk_country("Costa Rica") %>%
  auk_date(c("2019-01-01", "2019-03-31")) %>% 
  auk_last_edited(date = c("2019-01-01", "2020-02-29")) %>%
  auk_complete() %>%
  auk_filter(f_ebd, file_sampling = f_sed, overwrite=TRUE)

ebd_only <- read_ebd(f_ebd)
sed_only <- read_sampling(f_sed)

nrow(ebd_only)
#[1] 525
nrow(sed_only)
#[1] 18423

ebd_zf <- auk_zerofill(f_ebd, sampling_events = f_sed)
ebd_zf

Error in auk_zerofill.data.frame(x = ebd, sampling_events = sed, species = species,  : 
  Some checklists in EBD are missing from sampling event data.

Wondering if anyone has any insight into why this may be the case, and how I could solve this considering that i can't download a more recent custumized EBD file.

thanks, Sofia

mstrimas commented 3 years ago

You can't use different versions of the EBD and sampling event data. You have a Mar-2020 EBD and an Aug-2020 sampling event data. I understand that since you have sensitive data you probably can't get an Aug-2020 version. It is possible to combine these, but you'll have to do it manually. I'd start by using auk to subset the sampling event data:

library(auk)
library(tidyverse)

f_sed <- "~/data/sed_GGM.txt"
sed_filter <- auk_sampling("ebd_sampling_relAug-2020.txt") %>% 
  auk_country("Costa Rica") %>%
  auk_date(c("2019-01-01", "2019-03-31")) %>% 
  auk_complete() %>%
  auk_filter(f_sed, overwrite=TRUE)

Then read in the EBD directly, no need to subset it first since it's a small file, and subset both the EBD and SED to have the same set of checklists.

sed <- read_sampling(f_sed, unique = FALSE)
ebd <- read_ebd("ebd_sensitive_relMar-2020.txt", unique = FALSE)
ids <- intersect(sed$checklist_id, ebd$checklist_id)
sed <- filter(sed, checklist_id %in% ids)
ebd <- filter(ebd, checklist_id %in% ids)
zf <- auk_zerofill(ebd, sed, collapse = TRUE)

I don't have time to actually test any of this, so you may need to try it out and adjust the code, but this should get you started.

sofbol94 commented 3 years ago

Thanks, that was helpful, i'm having some issues though with the second part.

sed <- read_sampling(f_sed)
ebd <- read_ebd("ebd_sensitive_relMar-2020.txt")
ids <- intersect(sed$checklist_id, ebd$checklist_id)
sed <- filter(sed, checklist_id %in% ids)
ebd <- filter(ebd, checklist_id %in% ids)
zf <- auk_zerofill(ebd, sed, collapse = TRUE)

i took away unique=FALSE otherwise i had no column called checklist_id but when i write the command to intersect the file i have no absence and the zf has only checklist were the species was recorded. any suggestion?

thanks again, sofia

mstrimas commented 3 years ago

Hmmm, as I think about this more, I don't think you can correctly zero fill the data without the matching sampling event data. I think you'll need to request the most recent version of the Great Green Macaw data so it will match the sampling event data.

gking-aug commented 3 years ago

I wanted to follow up on this issue as I'm having a similar problem with auk_zerofill giving the error: "Some checklists in EBD are missing from sampling event data."

In my case I have ensured that the versions of the EBD and sampling event data match (both are Jan-2021). However, I am using a custom downloaded EBD dataset (all observations in Canada) and the full sampling event data. Based on a previous issue (now closed -- see here) I'm wondering if a mismatch between a custom dataset is the underlying issue? Unfortunately it seems the only way to check this would be to download the complete EBD and at 90GB I'll admit to be being a bit reticent.

I read in both of the successfully filtered EBD and sampling event files (via read_ebd and ebd_sampling, respectively) and they definitely reveal a different number of records (2864 vs. 2052 for my particular filters -- a bounding box in Alberta). So that is probably the issue. But when I try out the suggestion from @mstrimas to manually subset I end up with 454 common checklist_id observations.

This is my first project looking at the eBird data, so maybe I'm missing something here, but it seems there is something strange and maybe zero-filled data REQUIRES the full datasets?

BrittanyHBrown commented 2 years ago

Hi @gking-aug just wondering if you ever found a solution for your problem?

I am having an almost identical issue to you, and am having troubleshooting the issue myself.

Dd you end up needing to download the full EBD dataset? Or did you find a way to match up the custom download ebd & sampling event files for zerofilling?

Thanks!

gking-aug commented 2 years ago

Hi @BrittanyHBrown. This is a really good question -- the project was a directed reading and I haven't touched it in a while. Let me quickly investigate what I ended up doing and I will follow-up and post here.

nikkiregimbal commented 2 months ago

Building off the initial question in this thread, I am also new to auk and getting the same error. In my case, I am trying to use auk_zerofill for multiple datasets independently. My code is working for all except one dataset, even though from what I can tell it's exactly the same. I have ensured that all the months that the data covers is consistent and that all species are reported. Here is my code:

`#My code works for 2019 (in addition to 5 other years of data) US2019sed <- "Acadian Flycatcher/US_2019/ebd_US_acafly_201905_201908_smp_relMay-2024_sampling.txt" US2019check <- read_sampling(US2019sed) US2019ebd <- "Acadian Flycatcher/US_2019/ebd_US_acafly_201905_201908_smp_relMay-2024.txt" US2019obs <- read_sampling(US2019ebd)

US2019checksub <- subset(US2019check, all_species_reported == TRUE) US2019obssub <- subset(US2019obs, all_species_reported == TRUE)

zfUS19 <- auk_zerofill(US2019obssub, US2019checksub, collapse = TRUE)

When I replicate this for 2020 data, I get the error that some checklists from the EBD are missing sampling event data

US2020sed <- "Acadian Flycatcher/US_2020/ebd_US_acafly_202005_202008_smp_relMay-2024_sampling.txt" US2020check <- read_sampling(US2020sed) US2020ebd <- "Acadian Flycatcher/US_2020/ebd_US_acafly_202005_202008_smp_relMay-2024.txt" US2020obs <- read_sampling(US2020ebd)

US2020checksub <- subset(US2020check, all_species_reported == TRUE) US2020obssub <- subset(US2020obs, all_species_reported == TRUE)

zfUS20 <- auk_zerofill(US2020obssub, US2020checksub, collapse = TRUE)`

If anyone has any ideas of what might be going on, I'd really appreciate some feedback! I tried re-downloading the 2020 dataset a couple times now in case there was something wrong with the download, but get the same error.

mstrimas commented 2 months ago

First, you should be using read_ebd() to read in the observation data, so these lines:

US2019obs <- read_sampling(US2019ebd)
US2020obs <- read_sampling(US2020ebd)

Should be changed to

US2019obs <- read_ebd(US2019ebd)
US2020obs <- read_ebd(US2020ebd)

If you're still having problems after making that change, please post the error and we can try to troubleshoot it. Thanks!

nikkiregimbal commented 2 months ago

Thanks for the catch on the read_ebd @mstrimas. I updated that portion of my code and am still getting the same error.

US2020sed <- "Acadian Flycatcher/US_2020/ebd_US_acafly_202005_202008_smp_relMay-2024_sampling.txt"
US2020check <- read_sampling(US2020sed)

US2020ebd <- "Acadian Flycatcher/US_2020/ebd_US_acafly_202005_202008_smp_relMay-2024.txt"
US2020obs <- read_ebd(US2020ebd)

US2020checksub <- subset(US2020check, all_species_reported == TRUE)
US2020obssub <- subset(US2020obs, all_species_reported == TRUE)

zfUS20 <- auk_zerofill(US2020obssub, US2020checksub, collapse = TRUE)

Error in auk_zerofill.data.frame(US2020obssub, US2020checksub, collapse = TRUE) : 
  Some checklists in EBD are missing from sampling event data.

I am stumped because the same code is working on other datasets. Thanks!

mstrimas commented 2 months ago

This is a rare bug that I've describe here https://github.com/CornellLabofOrnithology/auk/issues/79#issuecomment-1934555208

In your case, right before you call auk_zerofill(), add something like the following:

US2020obssub <- US2020obssub[US2020obssub$checklist_id %in% US2020checksub%checklist_id, ]