cboettig / birddb

Import and query all of eBird locally
MIT License
10 stars 2 forks source link

partition the parquet files? #18

Open cboettig opened 2 years ago

cboettig commented 2 years ago

I'm having memory / RAM issues with both arrow & duckdb when working against a single large parquet file (either remotely, #17, or locally). I've noticed considerably better performance if I partition the parquet files, e.g.

## Restart, reload libs, try again with multi-parquet version of data
library(arrow)
library(duckdb)
library(dplyr)
server <- arrow::s3_bucket("ebird",endpoint_override = "minio.cirrus.carlboettiger.info")

path <- server$path("partitioned")
obs <- arrow::open_dataset(path)
obs$ls() # observe, multiple parquet shards
obs %>% count() 

works just fine for me. (you should also be able to use the above directly without any special credentials and without downloading any data).

Note that this partitions the data by year, lumping all pre-1990 data together. I do wish there was a more obvious way to construct partitions, I suspect with more & more evenly-sized partitions things would get better still. But I believe arrow limits the total number of partition files to a few thousand? so we can't partition on, say, species, and even so the data is structured in such a way that it is very hard to get even-sized partitions. To use year, we first have to process a year column from the dates, which is somewhat cumbersome (i.e. seems to crash out of memory, even though it shouldn't do so).

Anyway, will try and streamline that process and maybe come up with better partitions, e.g. setting a fixed number of rows per partition instead of using a variable...