Open mayamonk opened 1 month ago
I'm confused what's happening here:
f_ebd <- output_file
f_smp <- output_file
filters <- auk_ebd(f_ebd, file_sampling = f_smp)
you seem to be using the same file for both the observation dataset and the checklists dataset. I also don't understand why you have this second round of filtering. The idea is to filter the observations and checklists at the same time, i.e.
observations_input <- "ebd_US_relJun-2023.txt"
checklists_input <- "ebd_US_relJun-2023_sampling.txt"
observations_output <- "ebd-filtered-states_observations.txt"
checklists_output <- "ebd-filtered-states_checklists.txt"
ebird_data <- ?auk_ebd(observations_input, file_sampling = checklists_input) %>%
auk_date(date = c("2011-01-01", "2012-12-31")) %>%
auk_country(country = "United States") %>%
auk_state(states) %>%
auk_complete() %>%
auk_filter(file = observations_output, file_sampling = checklists_output)
The checklist file (ending in _sampling
) is provided when you select the "Include sampling event data" check box when downloading the data.
I'm having a problem with a duplication error while trying to produce zero-filled data using the complete eBird dataset to create a smaller presence-absence dataset. While following both tutorials from "Best Practices for Using eBird Data" (https://ebird.github.io/ebird-best-practices/) and "Introduction to auk" (https://cornelllabofornithology.github.io/auk/articles/auk.html#quick-start), the step to collapse the zero-filled data results in each entry being duplicated, and most seem to be 322 duplicates.
For instance, here is the code I used while following the "Introduction to Auk" tutorial:
library(auk) library(dplyr) library(ggplot2) library(gridExtra) library(lubridate) library(readr) library(sf)
states <- c("US-GA", "US-IL", "US-CO", "US-IN", "US-WI", "US-FL", "US-AZ", "US-NY", "US-MO", "US-WA", "US-DE")
input_file <- "/Volumes/UES_LAB/UWIN_acad_perf_analysis/ebird/ebd_US_relJun-2023.txt/ebd_US_relJun-2023.txt" output_file <- "ebd-filtered-states.txt" ebird_data <- input_file %>% auk_ebd() %>% auk_date(date = c("2011-01-01", "2012-12-31")) %>% auk_country(country = "United States") %>% auk_state(states) %>% auk_complete() %>% auk_filter(file = output_file) %>% read_ebd()
ebird_data %>% glimpse()
f_ebd <- output_file f_smp <- output_file filters <- auk_ebd(f_ebd, file_sampling = f_smp) %>% auk_state(states) %>% auk_complete() filters
ebd_sed_filtered <- auk_filter(filters, file = "ebd_filteredPA.txt", file_sampling = "sampling_filteredPA.txt") ebd_sed_filtered
read_ebd(ebd_sed_filtered)
A tibble: 1,070 × 48
read_ebd(f_ebd)
A tibble: 1,070 × 48
read_ebd(f_smp) A tibble: 1,070 × 48
here the data shows 1,070 entries and everything had worked thus far
ebd_zf <- auk_zerofill(ebd_sed_filtered) ebd_zf
Zero-filled EBD: 1,096 unique checklists, for 322 species.
ebd_zf_df <- collapse_zerofill(ebd_zf) class(ebd_zf_df) ebd_zf_df
A tibble: 352,912 × 57
After collapse_zerofill, each entry duplicates around 322 times. Using the other tutorial from "Best Practices for Using eBird Data" works the same way, in which the entries duplicate after the code:
zerofill <- auk_zerofill(observations, checklists, collapse = TRUE)
It also results in the same total number of entries: 352,912. Using code to remove duplicates is unsuccessful, such as:
unique.data.frame(zerofill) unique.array(zerofill) unique.matrix(zerofill)
("zerofill" is the name of the zero-filled dataset, these result in no change)
Has anyone run into this issue or knows a possible solution? Thanks!