Remove CRAN download instructions from README

trashbirdecology commented 2 years ago

Package not currently available via CRAN. Suggest removing from README.md.

cboettig commented 2 years ago

@trashbirdecology thanks! Should probably add a big mark for 'experimental' too.... we have a few more kinks to work out I think, but if you do give this a whirl, feedback would be so helpful!

also have a few even more experimental ideas that may speed the process up even more, possibly skipping the download/import process....

trashbirdecology commented 2 years ago

I've been toying around with how to most efficiently import and munge the sampling and ebd data for.....longer than I care to admit. I've settled on vroom and dplyr, mostly because I don't want to learn data.table. I also tried out package dtplyr, which is a backend wrapper for data.table, but often crashes my sessions when trying to convert the entire sampling df using dtplyr::lazy_dt().

I gave birddb a shot this morning but gave up when I couldn't figure out why import_ebird wouldn't accept my filepaths. Looking into the function code didn't provide any insight. No reprex right now because was frustrated and quit xD. I THINK it might be because the TAR was unpacked some time ago, saw a note suggesting that's a no no. eBird nerfed the download rates so there's no way I am going to try to download again to find the error here xD

trashbirdecology commented 2 years ago

FWIW, which may be nothing, here's what I've resorted to for import and filtering the ebd (either the entire EBD or species-states files). Still pretty inefficient but it's the best / most feasible process I've found works for my use case.

Very much looking forward to seeing what you two come up with, because this data handling process is something else...

#' @title Create and Write or Load In the Filtered eBird Data
#' @description Filter the eBird data and sampling events using R package AUK.
#' @param fns.ebird File paths for the EBD and SamplingEvents data to import. Character vector of filenames for original files.
#' @param overwrite Logical. If true will overwrite existing filtered data objects in project directory.
#' @param dir.ebird.out Location of where to save and find the filtered/subsetted data.
#' If not specified will default to subdir in project directory.
filter_ebird_data <-
  function(fns.ebird,
           dir.ebird.out,
           countries=NULL,
           states=NULL,
           complete.only = TRUE,
           protocol=c("Traveling", "Stationary"),
           species="Double-crested Cormorant",
           overwrite = FALSE,
           remove.bbs.obs=TRUE
           ) {

    f_samp_in  <- fns.ebird[str_detect(fns.ebird, "sampling_rel")]
    f_obs_in <- setdiff(fns.ebird, f_samp_in)
    if (!length(f_obs_in) > 0)
      stop(paste0("No ebd file identified. "))
    if (!length(f_samp_in) > 0)
      stop(paste0("No sampling file identified. "))

    # Make filenames for output (not using RDS for sampling data because too large...)
    f_obs_out <- paste0(dir.ebird.out, 'ebird_obs_filtered.rds')
    f_samp_out  <- paste0(dir.ebird.out, 'ebird_samp_filtered.txt')

    #specifying the column types helps with vroom::vroom(f_samp_in), which takes a couple of minutes...
    cols_samp <- list(
        `LAST EDITED DATE` = col_datetime(),
        country = col_character(),
        `COUNTRY CODE` = col_character(),
        STATE = col_character(),
        `STATE CODE` = col_character(),
        COUNTY = col_character(),
        `COUNTY CODE` = col_character(),
        `IBA CODE` = col_character(),
        `BCR CODE` = col_double(),
        `USFWS CODE` = col_character(),
        `ATLAS BLOCK` = col_character(),
        LOCALITY = col_character(),
        `LOCALITY ID` = col_character(),
        `LOCALITY TYPE` = col_character(),
        LATITUDE = col_double(),
        LONGITUDE = col_double(),
        `OBSERVATION DATE` = col_date(),
        `TIME OBSERVATIONS STARTED` = col_time(),
        `OBSERVER ID` = col_character(),
        `sampling event identifier` = col_character(),
        `protocol type` = col_character(),
        `PROTOCOL CODE` = col_character(),
        `PROJECT CODE` = col_character(),
        `duration minutes` = col_double(),
        `EFFORT DISTANCE KM` = col_double(),
        `EFFORT AREA HA` = col_double(),
        `NUMBER OBSERVERS` = col_double(),
        `ALL SPECIES REPORTED` = col_double(),
        `GROUP IDENTIFIER` = col_character(),
        `TRIP COMMENTS` = col_character()
        )

    ## Read in / filter sampling data frame
    if (file.exists(f_samp_out) & !overwrite) {
      sampling <-
        vroom::vroom(f_samp_out)
    } else{
      if (!exists("sampling"))
      cat("Importing the eBird sampling events data.
            This may take a minute.")
      # sampling <- data.table::fread(f_samp_in)
      sampling <- vroom::vroom(f_samp_in, col_types = cols_samp)

      cat("Filtering sampling events. This takes a minute.")
      sampling <- sampling %>%
        filter(if(complete.only) `all species reported` %in% c("TRUE","True", 1)) %>%
        filter(if(!is.null(protocol))`protocol type` %in% protocol)
      gc()
      sampling <- sampling %>%  #breaking this up to try to help wtih mem issues
        filter(if(!is.null(countries)) country %in% countries) %>%
        filter(if(!is.null(states)) STATE %in% region)
      gc()
      # remove BBS observations if specified
      if(remove.bbs.obs){
        sampling <- sampling %>%
        filter(`protocol type` != "Stationary" &
               `duration minutes` != 3)
        }

      ## write the filtered sampling data
      vroom::vroom_write(sampling, f_samp_out)
      }

    ## Read in / filter observations data frame
    if (file.exists(f_obs_out)| overwrite) {
      observations <- readRDS(f_obs_out)
    } else{
      observations <- vroom::vroom(f_obs_in)
      observations <- observations %>%
        filter(if(!is.null(species))`common name` %in% species) %>%
        filter(if(!is.null(countries)) country %in% countries) %>%
        filter(if(!is.null(states))state %in% region) %>%
        filter(if(!is.null(protocol))`protocol type` %in% protocol)
      observations <- observations %>%
        filter(if(complete.only) `sampling event identifier` %in% sampling$`sampling event identifier`) #keep only the complete cases
      saveRDS(observations, f_obs_out)
    }
  ebird_filtered <- list("observations" = observations,
                         "sampling" = sampling)

  rm(observations, sampling)

  return(ebird_filtered)

  } # END FUNCTION

trashbirdecology commented 2 years ago

Although, the filtering using AUK isn't the real bottleneck, it's the zero-filling process.

Here's my simple resort to zf. Still untested as am still currently writing these.


#' Create a Zero-filled Data Object for eBird Observations
#'
#' Creates a zero-filled data object comprising the eBird observations and sampling events data supplied.
#'
#' @param myList A list containing two named data frames, c("observations", "sampling"). This object is the result of \code{filter_ebird_data()}.
#' @param keep.orig Logical. If FALSE will delete the original object, myList, from memory.
#' @param cols.remove A vector of column names to be excluded from the output file.
#' @param cols.to.lowercase Logical. If TRUE will export a data frame where all colnames are in lowercase. Capitalization does not matter.
#' @export zerofill_ebird
zerofill_ebird <-
  function(myList,
           keep.orig = FALSE,
           cols.remove = c(
             "SUBSPECIES COMMON NAME",
             "TAXONOMIC ORDER",
             "LAST EDITED DATE",
             "CATEGORY",
             "APPROVED",
             "REVIEWED",
             "SPECIES COMMENTS",
             "...48",
             "HAS MEDIA",
             "REASON",
             "TRIP COMMENTS"
           )) {
    # Force columns to lowercase
      colnames(myList$sampling) <- tolower(colnames(myList$sampling))
      colnames(myList$observations) <- tolower(colnames(myList$observations))
      message("column names forced to lowercase.")
      cols.remove <- tolower(cols.remove)

    # First remove the unwanted columns
    myList$observations <-
      myList$observations[!names(myList$observations) %in% cols.remove]
    myList$sampling     <-
      myList$sampling[!names(myList$sampling) %in% cols.remove]

    # Create observation count and species column names to add the zeroes to sampling data
    myList$sampling$`common name` = unique(myList$observations$`common name`)
    myList$sampling$`observation count` = 0
    myList$observations$`observation count` = as.integer(myList$observations$`observation count`)

    # Full join the filtered sampling events to species observations
    ebird_zf <-
      full_join(myList$observations, myList$sampling)

    # Create date and julian day variables
    ebird_zf <-
      ebird_zf %>% mutate(
        julian = lubridate::yday(`observation date`),
        year = lubridate::year(`observation date`)
      )

    # Remove original data object
    if(!keep.orig){rm(myList)}

    return(ebird_zf)
  }

cboettig commented 2 years ago

thanks for sharing, this is great. Yeah, doing the full join for the zero-fill is definitely top of our list; something we want to pre-compute maybe during the import process and/or even upstream of the package. I'll take a closer look at this workflow and let you know how it goes. @mstrimas is a collaborator here and a huge help on this so far.

trashbirdecology commented 2 years ago

Yeah, they pointed me here and I've been bugging the shit out of them over at auk xD

mstrimas commented 2 years ago

@trashbirdecology I should have clarified when I pointed you here, at the moment this is mostly for filtering the data and importing into R, birddb doesn't really address your zero filling issue, which I'm sure you've noticed already...

@cboettig regarding pre-computing these joins, an important point to remember here is that in almost all cases users will not want the simple left join here. what's really needed is to roll subspecies up to the species level, then perform the left join. There's also another potential issue with shared checklists where one user saw the species and another didn't. These corrections could also happen after zero-filling I think, but it's critical that, for example, users don't get a 0 for Yellow-rumped Warbler when there's a Myrtle Warbler on the list.

trashbirdecology commented 2 years ago

@mstrimas no worries. What is the best practice over at CLO for handling partial-group non-detection of a species? Do you default to saying that if one person saw it it existed?

trashbirdecology commented 2 years ago

Also, @cboettig-- you probably are already aware of this, but the parquet /arrow approach is probably preferred -- vroom often crashes when trying to import the entire sampling data object. You've likely already seen these but for posterity: https://stackoverflow.com/questions/68628271/partially-read-really-large-csv-gz-in-r-using-vroom

mstrimas commented 2 years ago

Correct, the best practice for group checklists is to collapse them down to a single checklist consisting of all the species seen by anyone in the group. This is the approach taken in auk::auk_unique().

cboettig commented 2 years ago

Thanks @mstrimas , was wondering about these same issues; you must have read my mind. Right, totally agree these data pre-proccessing steps are critical and also subtle, though I think it's natural to express each of the pre-processing algorithms in SQL (or equivalently in dplyr verbs that translate themselves to SQL), and have them run on the duckdb backend, e.g. at import. We can certainly summarise over the sub-species before doing the left join. I think you mentioned you do this in SQL internally, do you have the SQL commands for each operation you can share? (no worries, I can probably figure it out anyway ... just waiting for my October snapshot to finish importing....)

@trashbirdecology definitely arrow / duckdb / parquet all the way here, no vroom involved. Currently the import_ebird() extracts the csv from the tar, and uses arrow to write parquet, which we then read with duckdb. In theory at least this should always leave all the data 'on disk' and minimal RAM footprint, avoiding crashes. According to benchmarks, duckdb should be considerably faster even then native dplyr (at least with good SSD harddrives), which is saying something since disk-based operations are inherently slower than RAM. Also, it opens the door for us to do other forms of remote filesystem access (https://arrow.apache.org/docs/r/articles/fs.html) -- more on that to come.

mstrimas commented 2 years ago

Moving this to #16 , I think this issue can be closed now.

trashbirdecology commented 2 years ago

@cboettig i've made public the repo i'm working in to import and munge the ebird data (and integrate with BBS).

it is not sophisticated and doesn't use the ebird as a database, rather importing directly into R, but figured i would share now that its public.

https://github.com/trashbirdecology/dubcorms

cboettig / birddb

Remove CRAN download instructions from README #15