CornellLabofOrnithology / ebird-best-practices

Best Practices for Using eBird Data
https://CornellLabOfOrnithology.github.io/ebird-best-practices/
Other
32 stars 12 forks source link

Consider distributing ebird data in parquet format as well? #12

Open cboettig opened 3 years ago

cboettig commented 3 years ago

Dear eBird team,

Thanks for all you do for the research community! I greatly appreciate the good common sense in distributing ebird's complete monthly snapshots as a direct download using compressed plain text. This is an efficient and highly interoperable format. Nevertheless, the file's size poses challenges to some users, as the awkward interface provided by auk using the system version of awk illustrates when attempting to get around intrinsic memory limitations of the R language. While this approach is immensely clever, it is neither very convenient nor very necessary using modern tools. If you provided the download as a parquet file, modern toolset in R or any other modern language could access the local data file directly without needing to read the file into memory, while also providing much greater speed a flexibility of modern databases.

Consider the following minimal example in R:

For this demo, I just read in the sample tsv file, not possible with the full record.

sample <- readr::read_tsv("https://minio.thelio.carlboettiger.info/shared-data/ebird/ebd_sample.txt", quote="")
sample <- sample[-47] # extraneous column?
## Write the file out using the parquet
file <- "ebd_sample.parquet"
arrow::write_parquet(sample, file)

Now we can use modern toolchain of DBI+duckdb+dplyr to query the data without ever reading it all into memory:

library(DBI)
library(duckdb)
library(dplyr)
tblname <- "ebd_sample"
conn <- DBI::dbConnect(duckdb::duckdb())
query <- paste0("CREATE VIEW '", tblname,
                "' AS SELECT * FROM parquet_scan('",
                file, "');")
DBI::dbSendQuery(conn, query)

## User creates a remote connection using `dplyr::tbl`
ebd <- tbl(conn, tblname)

## this object looks and feels like a native data.frame, and users can take advantage of any of the fast dplyr functions to work with the data, like so: 
ebd %>% count(`SCIENTIFIC NAME`)

Under the hood, dplyr is just translating R commands into SQL; and duckdb is interpreting the SQL to query the parquet file from disk using highly optimized database techniques. (duckdb also interfaces with most other major languages -- and what could be better -- it too has a bird in the name!)

I strongly suggest eBird consider distributing a parquet file version in addition to the plain text version for research downloads. It has the added benefit of including features like compression and native typing of dates and numbers, unlike plain text files, and parquet these days is as you surely know a reasonably well recognized file standard.

mstrimas commented 3 years ago

Hey Carl, thanks for bringing this up, I couldn't agree more! Particularly as the size of this dataset has grown so much over the past few years, the plain text format we're using has become intractable.

One feature of this dataset that I've been thinking about as I consider alternative formats is the checklist structure. eBird data are submitted as checklists with each checklist containing multiple observations of the bird species detected. The majority of the attributes are collected at the checklist level (location, date, time, effort, etc.) and only a small number of attributes are unique to the observations (species, count of # of individuals, etc.). There are ~60 million checklists and > 1 billion observations, and in the currently format, all the checklist level information gets duplicated for every observation, massively increasing the size of the dataset. Give all this, it seems like a relational database format may make sense, one table for checklist data and one for observation data that can be joined as needed. I'm not sure if the decrease in file size warrants the increased computation cost of performing the join on two large tables.

Wondering if you have any thoughts here. Would something like SQLite make sense for these data? I wasn't aware parquet files could be queried with SQL, does that extent to being able to perform joins on two parquet files/tables? I'd never heard of duckdb before, maybe storing two tables in a duckdb file makes more sense than SQLite (certainly the bird-themed name is appropriate)?

cboettig commented 3 years ago

Hey @mstrimas , thanks for the reply and thanks for your great work on auk. Hope this didn't come off too critical, honestly auk is immensely clever stuff, but as someone deep in the data you know the challenges well.

Yeah, so duckdb is kind of a modern successor to SQLite -- like SQLite, it's serverless so easy to set up. But it has more complete support for SQL (like windowed operations) and is considerably faster, faster than server-based systems like postgres or even some in-memory operations in pure dplyr, see https://doi.org/10.1145/3299869.3320212.

The support for parquet format is a recent addition in duckdb. Standard parquet libraries like arrow can already do a subset of basic SQL operations (basically filter or select), but most databases (including duckdb and SQLite) have always had their own binary db format for on-disk storage. This recent addition essentially lets duckdb treat parquet as it's on-disk storage, without requiring import into it's own format, see https://duckdb.org/docs/data/parquet.

But better yet, I remembered that arrow already supports streaming from text to parquet. So in fact, we should already be able to throw together a small package that would use arrow to do the one-time translation from txt to parquet, and then use duckdb to enable efficient queries of the data.

Like you say, this opens up a lot of other options as well, including a relational data structure with multiple tables. we could do some benchmarks to see how much it can save to break out the checklists etc. Parquet files can also be 'sharded', so a bunch of files can be treated as a single one. this is particularly useful in database updates, since new data could be added potentially without having to overwrite the old data (unless those receive corrections?)