Add ability to zero-fill data

mstrimas commented 2 years ago

Moving this discussion here since #15 is unrelated as initially reported.

Zero-filling is based on the concept on a "complete checklist", which means it only applies to checklists where all_species_reported == 1. In addition, only species can be zero-filled. For example, Yellow-rumped Warbler can be zero-filled, but taxa below species (e.g. "Yellow-rumped Warbler (Myrtle)") and taxa reported above species (e.g. "Warbler sp.") cannot be zero-filled. Scientific names corresponding to species can be identified as those in the table auk::ebird_taxonomy with category == "species"; this category column also appears in the EBD.

Rolling up taxa below species: eBirders can report taxa below the species level and, in many cases, a checklist can have multiple taxa all corresponding to the same species, e.g.

library(auk)
library(dplyr)
system.file("extdata/ebd-rollup-ex.txt", package = "auk") %>% 
  read_ebd(rollup = FALSE) %>% 
  filter(common_name == "Yellow-rumped Warbler") %>% 
  select(checklist_id, common_name, subspecies_common_name, observation_count)
#   checklist_id common_name           subspecies_common_name            observation_count
# 1 S8949407     Yellow-rumped Warbler NA                                1                
# 2 S8949407     Yellow-rumped Warbler Yellow-rumped Warbler (Myrtle)    13               
# 3 S8949407     Yellow-rumped Warbler Yellow-rumped Warbler (Audubon's) 17

All of these observations need to be rollup up to the species level either prior to or immediately after zero-filling since the complete checklist concept only applies to species. The typical way to do this is to group by scientific_name or common_name and sum observation_count. However, one catch is that observation_count can be "X" indicating the bird was detected, but no count was recorded. Typically we address this by breaking observation_count into two variables indicating detection and count, respectively.

library(auk)
library(dplyr)
system.file("extdata/ebd-rollup-ex.txt", package = "auk") %>% 
  read_ebd(rollup = FALSE) %>% 
  mutate(observation_count = recode(observation_count, "X" = NA_character_),
         observation_count = as.integer(observation_count)) %>% 
  group_by(checklist_id, common_name) %>% 
  summarize(observation_count = sum(observation_count),
            species_detected = any(is.na(observation_count) | observation_count > 0), 
            .groups = "drop") %>% 
  filter(common_name == "Yellow-rumped Warbler")
#   checklist_id common_name           observation_count species_detected
# 1 S8949407     Yellow-rumped Warbler                31 TRUE

However, we could also just retain a single column and use NA to indicate detection but no count, I've just found sometimes that's confusing for users.

Once taxa below species have been dealt with, you can zero fill a single species by left-joining to the checklist data and replacing NAs with 0s. For multiple species, unless there's some fancy way to do this in SQL you either need to loop through the species and left join one at a time or left join then use something like tidyr::complete() to fill in the missing cases.

A few additional notes that come to mind:

We need to find some way to ensure that checklists have been filtered to only complete checklists prior to zero-filling.
It typically only makes sense to zero-fill one or a set of target species for a specific region. If you try to zero fill all species you'll end up with a massive dataset telling you, for example, that Kiwis don't occur in the Bahamas. Many users forget to subset to a geographic region or to a set of species before trying to zero fill, which results in a massive dataset and very long run times; this is rarely, if ever, what they want. With the exception of global species, like Rock Pigeon or Osprey, users almost always want to subset geographically to roughly the range of their target species.
It's absolutely critical when zero-filling that the exact same population of checklists occurs in both the checklist and observation data. This means that the same set of temporal and geographic filters need to be applied to both datasets and that zero-filing should only every occur when the two datasets are exactly the same version, e.g. ebd_relDec-2021.txt and ebd_sampling_relDec-2021.txt.

cboettig commented 2 years ago

Thanks @mstrimas , this is excellent as always!

Rolling up to species makes sense as you describe, as does handling the "X" observed only case. I gather both incomplete checklists and genus-level-only identifications are just filtered/dropped from the data in this context? (Arguably some users may want to 'roll up' to a higher-taxonomy level than species, which I suppose could be supported as well, at least to Genus).

I think I follow the process for the actual join...

A few notes as I digest this:

dbplyr can translate some tidyr verbs to SQL, including complete() : https://dbplyr.tidyverse.org/reference/index.html (though I need to confirm that is available in duckdb -- it should be, but the dbplyr team may not have written the binding yet)...
With the dbplyr backend, the SQL queries are executed "lazily". I think this has important ramifications for the spatial sub-setting issue you mention -- i.e. it should be possible to generate the View, which is just (if I understand correctly), the SQL code for the fill, but not execute it. At least in theory, if a user adds some spatial filter to the data first with additional dplyr commands, the SQL engine (i.e. duckdb) is supposed to be smart enough to execute the filters first, and then fill on the fly. (In any event we could design a function call to enforce that order, but it does leave open the risk that a user tries a very large spatial area and things will presumably go pear-shaped very very fast!

I haven't really wrapped my head around what this looks like in a "real use case". (i.e. I definitely see why I need to join in the effort data from sampling_event_id's found in the checklists table that saw no birds at all, and thus aren't in the observations table. But I'm trying to wrap my head around if I really need to complete() for explicit zeros, somehow I feel like I should be able to capture that aspect in the modeling, but I'm probably confused).

Good call on versioning, we can probably enforce that.

Minor thing I noticed in your example code: I think you want an na.rm = TRUE in summarize(observation_count = sum(observation_count), though when dplyr is using a SQL backend that is implied. You don't have any "X" counts in the yellow-rumped warblers, hence the total isn't just NA for that case.

mstrimas commented 2 years ago

The missing na.rm = TRUE is intentional here, if there's an X for any taxa rolling up to species we give an X to the whole species since we don't know what the actual count is. Of course, that is a choice, we could also include the na.rm = TRUE with the understanding that count is then more of a lower limit, but that's not typically what we've suggested as the best practice. This isn't the best example since there are no NA counts for this species as you point out....

I didn't realize some of the tidyr verbs had been implemented as well, that's useful! The SQL generated by complete is pretty convoluted, I'll be curious to see how it performs in practice on a large dataset:

library(dbplyr)
library(tidyr)
df <- memdb_frame(
    group = c(1:2, 1),
    item_id = c(1:2, 2),
    item_name = c("a", "b", "b"),
    value1 = 1:3,
    value2 = 4:6
)
df %>% complete(group, nesting(item_id, item_name)) %>% show_query()

Assuming users properly subset the data spatially and taxonomically, hopefully this will be feasible...

If you just have a single species you don't need complete() but as soon as you have multiple the join won't generate all the combinations, e.g. (putting this together quick, hopefully I didn't screw this up):

library(dplyr)
library(tidyr)

checklists <- tibble(checklist_id = c("a", "b", "c"))
observations <- tibble(checklist_id = c("a", "b", "b"),
                       species = c("x", "x", "y"),
                       count = c(1, 2, 1))
# one species works, just need to replace NAs
left_join(checklists, observations %>% 
            filter(species == "y"))
# but including two leads to missing cases
left_join(checklists, observations)
# completing the missing cases
inner_join(checklists, observations) %>% 
  complete(checklist_id = unique(checklists$checklist_id), 
           species, 
           fill = list(count = 0))

cboettig / birddb

Add ability to zero-fill data #16