cboettig / birddb

Import and query all of eBird locally
MIT License
10 stars 2 forks source link

Edits from user feedback #9

Closed mstrimas closed 3 years ago

mstrimas commented 3 years ago

Benjamin Van Doren, one of the Lab of O postdocs, used the package and offered some feedback, which I've tried to incorporate here. Changes are:

  1. README.md needed updating to reflect the new function naming.
  2. sha256 hashing was extremely slow for the full EBD, he suggested a switch to crc32 hashing, which is apparently much faster and used for this purpose by TensorFlow.
  3. Ben was working with limited disk space so stored the EBD tar file and the output parquet file on an external drive, however, he wasn't able to use import_ebird() because the temp untarred file needed 100GB+ on his local drive. To address this, I've allow the temp directory to be specified as a parameter in import_ebird(). Another approach would be to use BIRDDB_HOME as the temp directory since that directory has presumably been chosen to have lots of space.
mstrimas commented 3 years ago

I'm seeing now in GitHub actions that arrow v6.0.0 breaks our tests, will investigate.

cboettig commented 3 years ago

@mstrimas I saw the issues on arrow too, I have a proposed patch on https://github.com/cboettig/birddb/tree/arrow-devel branch I should open as a PR. It simplifies the parsing logic but would like your feedback before merging

cboettig commented 3 years ago
mstrimas commented 3 years ago

Great, thanks! I'll take a look at your PR tomorrow, then revisit this PR once yours is merged.

mstrimas commented 3 years ago

I'll test these changes on the full dataset this morning.

I can't make sense of the macOS error in the GitHub actions check, I'm hoping maybe I'll go away on it's own next time I push 🤞

mstrimas commented 3 years ago

Re-imported the full August 2021 EBD, both checklists and observations, and everything looks good. Timings for a simple extraction:

library(birddb)
library(dplyr)
obs <- observations()
#> 60 seconds
obs_woothr <- obs %>% 
  filter(state_code == "US-NY", common_name == "Wood Thrush") %>% 
  collect()
#> 34 minutes

I think everything looks good here, apart from the macOS failure in the GitHub actions checks, which remains mysterious to me, especially since I'm testing locally on macOS. Any idea what's going on there?

cboettig commented 3 years ago

nice!

yeah, definitely a weird error on the Mac test -- you're using the same mechanism to install the nightly version of arrow? Might try installing the nightly using arrow::install_arrow(TRUE, FALSE, minimal = FALSE), to force a source install instead of a binary install? Otherwise yeah sounds like a not-our-problem kinda bug that should go away.

Alternately, is it necessary to test on the nightly version of arrow? Now that 6.0 is released, maybe we should stick with the CRAN test. (really it might make sense to cover both arrow-devel and arrow-release as options in the config matrix...)

Anyway, those CI issues I see as orthogonal, so I'm happy to merge this as is unless you want to take a crack at a quick fix like binary arrow install or CRAN arrow install?