"cloud-native" access option

cboettig commented 2 years ago

Our current model is built around first downloading the full dataset, and then importing it to parquet from where we query it with duckdb. As nice as this is, with modern "cloud-native" data storage APIs it is actually possible to skip the whole download part. Check this out:

## My public MINIO bucket
library(arrow)
cirrus <- s3_bucket("shared-data",
  scheme = "https",
  endpoint_override = "minio.cirrus.carlboettiger.info"
)
path <- cirrus$path("birddb/parquet")

## Create a remote connection to the parquet data
ebird <- open_dataset(path)

## Operate directly on the remote data
orders <- ebird %>%
  to_duckdb() %>%
  count(`TAXONOMIC ORDER`) %>%
  collect()

This requires 0 MB disk-space of the user -- on-disk operations are performed over the network. Obviously that adds significant latency compared to querying data on your local hard-disk, but in contexts where disk space is limited or for one-off use, this could be significantly faster and easier than having to first download the full dataset. This feels just like querying a remote SQL server, but it isn't -- the computation is done on the user's machine; the data host is just a static fileserver with an S3-compatible API (MINIO, an open-source object store).

Further, in some contexts this can be almost as fast as directly accessing parquet files downloaded to the local disk -- i.e. when the compute and the data host are on the same network. For instance, GBIF now publishes parquet snapshots to AWS opendata registry, https://registry.opendata.aws/gbif/, so a user can merely match the region of the S3 instance to their EC2 instance and run such queries directly. The EC2 instance wouldn't need to have much local harddisk storage and the operations would be almost as fast.

I think we want to continue to stick with the option to download the data locally, as for intensive analyses that can offer considerable speed-up, but it would be extra compelling to get the EBird team to consider publishing releases to providers like AWS opendata registry. birddb could contain a bit of logic to let a user opt in to the 'cloud-native data access' instead of the local database.

mstrimas commented 2 years ago

This is super interesting! I'm not familiar with MinIO, is that essentially a way to mimic S3 on your own server? Like, if the parquet file was on S3 would you still need MinIO?

I wasn't able to run the snippet above, I'm getting:

ebird <- open_dataset(path)
Error: IOError: When getting information for key 'birddb/parquet' in bucket 'shared-data': AWS Error [code 15]: No response body. with address : 128.32.85.8 with address : 128.32.85.8

I'd be curious to see how this performs against the full EBD.

mstrimas commented 2 years ago

Same error when trying to connect to that GBIF bucket:

library(arrow)
bucket <- s3_bucket("gbif-open-data-us-east-1/")
bucket$path("occurrence/2021-04-13/occurrence.parquet/")
gbif <- open_dataset(bucket)

Yet am I able to see the contents via the AWS CLI:

aws s3 ls s3://gbif-open-data-us-east-1/occurrence/2021-04-13/occurrence.parquet/

cboettig commented 2 years ago

yeah, I've seen that too, the interface seems a little buggy. If I restart R and try again I can usually get it to connect. You can also see if the 'official' examples work for you at https://arrow.apache.org/docs/r/articles/fs.html to be sure.

Correct, MINIO is an open-source, self-hosted S3 service. It's pretty easy to deploy and apparently widely adopted. https://docs.min.io/docs/minio-quickstart-guide.html If the data is already on S3, then no need to do this; I just use my own MINIO since we have a little server on campus (with free gigabyte network) and I'm too cheap to pay for S3 :-) Support for other platforms like azure is also in the works.

cboettig / birddb

"cloud-native" access option #12