CornellLabofOrnithology / auk

Working with eBird data in R
GNU General Public License v3.0
136 stars 20 forks source link

Issues with duplicate rows following zero filling #83

Open mayamonk opened 1 month ago

mayamonk commented 1 month ago

I'm having a problem with a duplication error while trying to produce zero-filled data using the complete eBird dataset to create a smaller presence-absence dataset. While following both tutorials from "Best Practices for Using eBird Data" ( and "Introduction to auk" (, the step to collapse the zero-filled data results in each entry being duplicated, and most seem to be 322 duplicates.

For instance, here is the code I used while following the "Introduction to Auk" tutorial:

library(auk) library(dplyr) library(ggplot2) library(gridExtra) library(lubridate) library(readr) library(sf)

states <- c("US-GA", "US-IL", "US-CO", "US-IN", "US-WI", "US-FL", "US-AZ", "US-NY", "US-MO", "US-WA", "US-DE")

input_file <- "/Volumes/UES_LAB/UWIN_acad_perf_analysis/ebird/ebd_US_relJun-2023.txt/ebd_US_relJun-2023.txt" output_file <- "ebd-filtered-states.txt" ebird_data <- input_file %>% auk_ebd() %>% auk_date(date = c("2011-01-01", "2012-12-31")) %>% auk_country(country = "United States") %>% auk_state(states) %>% auk_complete() %>% auk_filter(file = output_file) %>% read_ebd()

ebird_data %>% glimpse()

f_ebd <- output_file f_smp <- output_file filters <- auk_ebd(f_ebd, file_sampling = f_smp) %>% auk_state(states) %>% auk_complete() filters

ebd_sed_filtered <- auk_filter(filters, file = "ebd_filteredPA.txt", file_sampling = "sampling_filteredPA.txt") ebd_sed_filtered


A tibble: 1,070 × 48


A tibble: 1,070 × 48

read_ebd(f_smp) A tibble: 1,070 × 48

here the data shows 1,070 entries and everything had worked thus far

ebd_zf <- auk_zerofill(ebd_sed_filtered) ebd_zf

Zero-filled EBD: 1,096 unique checklists, for 322 species.

ebd_zf_df <- collapse_zerofill(ebd_zf) class(ebd_zf_df) ebd_zf_df

A tibble: 352,912 × 57

After collapse_zerofill, each entry duplicates around 322 times. Using the other tutorial from "Best Practices for Using eBird Data" works the same way, in which the entries duplicate after the code:

zerofill <- auk_zerofill(observations, checklists, collapse = TRUE)

It also results in the same total number of entries: 352,912. Using code to remove duplicates is unsuccessful, such as: unique.array(zerofill) unique.matrix(zerofill)

("zerofill" is the name of the zero-filled dataset, these result in no change)

Has anyone run into this issue or knows a possible solution? Thanks!

mstrimas commented 1 month ago

I'm confused what's happening here:

f_ebd <- output_file
f_smp <- output_file
filters <- auk_ebd(f_ebd, file_sampling = f_smp)

you seem to be using the same file for both the observation dataset and the checklists dataset. I also don't understand why you have this second round of filtering. The idea is to filter the observations and checklists at the same time, i.e.

observations_input <- "ebd_US_relJun-2023.txt"
checklists_input <- "ebd_US_relJun-2023_sampling.txt"
observations_output <- "ebd-filtered-states_observations.txt"
checklists_output <- "ebd-filtered-states_checklists.txt"
ebird_data <- ?auk_ebd(observations_input, file_sampling = checklists_input) %>%
  auk_date(date = c("2011-01-01", "2012-12-31")) %>%
  auk_country(country = "United States") %>%
  auk_state(states) %>%
  auk_complete() %>%
  auk_filter(file = observations_output, file_sampling = checklists_output)

The checklist file (ending in _sampling) is provided when you select the "Include sampling event data" check box when downloading the data.