Open mstrimas opened 2 years ago
Thanks @mstrimas , this is excellent as always!
Rolling up to species makes sense as you describe, as does handling the "X" observed only case. I gather both incomplete checklists and genus-level-only identifications are just filtered/dropped from the data in this context? (Arguably some users may want to 'roll up' to a higher-taxonomy level than species, which I suppose could be supported as well, at least to Genus).
I think I follow the process for the actual join...
A few notes as I digest this:
dbplyr
can translate some tidyr verbs to SQL, including complete()
: https://dbplyr.tidyverse.org/reference/index.html (though I need to confirm that is available in duckdb -- it should be, but the dbplyr team may not have written the binding yet)...
With the dbplyr backend, the SQL queries are executed "lazily". I think this has important ramifications for the spatial sub-setting issue you mention -- i.e. it should be possible to generate the View, which is just (if I understand correctly), the SQL code for the fill, but not execute it. At least in theory, if a user adds some spatial filter to the data first with additional dplyr commands, the SQL engine (i.e. duckdb) is supposed to be smart enough to execute the filters first, and then fill on the fly. (In any event we could design a function call to enforce that order, but it does leave open the risk that a user tries a very large spatial area and things will presumably go pear-shaped very very fast!
I haven't really wrapped my head around what this looks like in a "real use case". (i.e. I definitely see why I need to join in the effort data from sampling_event_id's found in the checklists table that saw no birds at all, and thus aren't in the observations table. But I'm trying to wrap my head around if I really need to complete()
for explicit zeros, somehow I feel like I should be able to capture that aspect in the modeling, but I'm probably confused).
Good call on versioning, we can probably enforce that.
Minor thing I noticed in your example code: I think you want an na.rm = TRUE
in summarize(observation_count = sum(observation_count)
, though when dplyr is using a SQL backend that is implied. You don't have any "X" counts in the yellow-rumped warblers, hence the total isn't just NA for that case.
The missing na.rm = TRUE
is intentional here, if there's an X for any taxa rolling up to species we give an X to the whole species since we don't know what the actual count is. Of course, that is a choice, we could also include the na.rm = TRUE
with the understanding that count is then more of a lower limit, but that's not typically what we've suggested as the best practice. This isn't the best example since there are no NA counts for this species as you point out....
I didn't realize some of the tidyr
verbs had been implemented as well, that's useful! The SQL generated by complete is pretty convoluted, I'll be curious to see how it performs in practice on a large dataset:
library(dbplyr)
library(tidyr)
df <- memdb_frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
df %>% complete(group, nesting(item_id, item_name)) %>% show_query()
Assuming users properly subset the data spatially and taxonomically, hopefully this will be feasible...
If you just have a single species you don't need complete()
but as soon as you have multiple the join won't generate all the combinations, e.g. (putting this together quick, hopefully I didn't screw this up):
library(dplyr)
library(tidyr)
checklists <- tibble(checklist_id = c("a", "b", "c"))
observations <- tibble(checklist_id = c("a", "b", "b"),
species = c("x", "x", "y"),
count = c(1, 2, 1))
# one species works, just need to replace NAs
left_join(checklists, observations %>%
filter(species == "y"))
# but including two leads to missing cases
left_join(checklists, observations)
# completing the missing cases
inner_join(checklists, observations) %>%
complete(checklist_id = unique(checklists$checklist_id),
species,
fill = list(count = 0))
Moving this discussion here since #15 is unrelated as initially reported.
Zero-filling is based on the concept on a "complete checklist", which means it only applies to checklists where
all_species_reported == 1
. In addition, only species can be zero-filled. For example, Yellow-rumped Warbler can be zero-filled, but taxa below species (e.g. "Yellow-rumped Warbler (Myrtle)") and taxa reported above species (e.g. "Warbler sp.") cannot be zero-filled. Scientific names corresponding to species can be identified as those in the tableauk::ebird_taxonomy
withcategory == "species"
; thiscategory
column also appears in the EBD.Rolling up taxa below species: eBirders can report taxa below the species level and, in many cases, a checklist can have multiple taxa all corresponding to the same species, e.g.
All of these observations need to be rollup up to the species level either prior to or immediately after zero-filling since the complete checklist concept only applies to species. The typical way to do this is to group by
scientific_name
orcommon_name
and sumobservation_count
. However, one catch is thatobservation_count
can be "X" indicating the bird was detected, but no count was recorded. Typically we address this by breakingobservation_count
into two variables indicating detection and count, respectively.However, we could also just retain a single column and use NA to indicate detection but no count, I've just found sometimes that's confusing for users.
Once taxa below species have been dealt with, you can zero fill a single species by left-joining to the checklist data and replacing NAs with 0s. For multiple species, unless there's some fancy way to do this in SQL you either need to loop through the species and left join one at a time or left join then use something like
tidyr::complete()
to fill in the missing cases.A few additional notes that come to mind:
ebd_relDec-2021.txt
andebd_sampling_relDec-2021.txt
.