Investigate replacing AWS Athena with DuckDB

dfsnow commented 1 year ago

DuckDB is the new thing in the data engineering space right now. It's basically SQLite but for analytical workloads instead of application workloads. It might fit our use-case pretty well and simplify some existing/future complexity. We should investigate its viability for our data at some point in the near future.

DuckDB's biggest value add for us is its ability to read Parquet files directly from S3. Something like the code below could be used to create a local database file full of views that are effectively pointers to S3 buckets. We could then build views on top of those pointers, just like in Athena. This would essentially make a DuckDB file our main query layer, replacing Athena.

library(duckdb)

conn <- DBI::dbConnect(duckdb("test.db"))

dbExecute(
  conn,
  "
  INSTALL httpfs;
  LOAD httpfs;
  SET s3_region='---';
  SET s3_access_key_id='---';
  SET s3_secret_access_key='---';
  "
)

dbExecute(
  conn,
  "
  CREATE OR REPLACE SCHEMA model;
  CREATE OR REPLACE VIEW model.shap AS
    SELECT *
    FROM parquet_scan('s3://bucket_name/shap/year=2023/run_id=2023-03-15-clever-kyra/*/*.parquet', hive_partitioning=true)
  "
)

shap <- dbGetQuery(
  conn,
  "
  SELECT *
  FROM model.shap
  WHERE run_id = '2023-03-15-clever-kyra'
  AND year = '2023'
  AND township_code = '77'
  "
)

Why (would we do this)

DuckDB is fast. It's much faster than Athena, even over network. The simple test query above runs in 1.4s via DuckDB vs 9.2s via Athena. Views and more complicated queries could see even bigger speedups. Though query speed is not a top concern, this would help with some use cases (feeding Tableau, future live applications, model input views).
No need for AWS Glue or crawlers. DuckDB uses the metadata of the Parquet files for queries, meaning we wouldn't need to maintain a metastore via Glue's data catalog and new partitions would be found automatically. This does have some huge potential downsides, see complications.
Easier local and CI testing. Now that we're building out dbt, we'll need separate test/prod environments in Athena. This will involve a lot of complexity (see #31 and #28), as we'll need to potentially maintain separate tables and views for testing, or do something truly heinous like this. For DuckDB, building and testing new versions of views could be done locally or entirely within GitHub Actions, no separate test env needed.
DuckDB supports more of standard SQL than Athena, including constraints, indexes, and keys. Those could be useful for our efforts to enforce referential integrity on our existing data.
Smaller one-off databases can be built to support specific applications and speed up query times.

How (would we do this)

I think DuckDB could essentially be a drop-in replacement for Athena. We would create stub views (as shown above) that act as tables currently do in Athena. We would then just build our existing views on top of those pointers.

The result would be a local file that contains pointers and view definitions for our entire data lake. This file would be built and tested via GitHub Actions and then uploaded to S3. From there, we could download a single local copy of the file to the main Data Dept. server, and point all connections to that file. Any changes to the views or underlying data would kick off a new workflow and replace the local file.

If we wanted, we could also just build the entire current version of the data lake into the DB file. This would likely speed things up even further for local queries on the server.

Complications

DuckDB is intended to be something similar to SQLite i.e. it's not intended for multiplayer, online use. Syncing the local copy of the DB file would be a pain. There are lots of people working on solutions to this problem, but they are pretty complicated:
Applications like Tableau would need a local copy of the DB file in order to stay current.
We currently use Glue for orchestration stuff, and we still are going to need an orchestrator for some things. This doesn't solve that problem.
Can you replace a DuckDB file mid-connection? Because that would likely happen constantly if we were replacing a local copy via GHA. Edit: it seems like the R package stores the S3 pointers in-memory, as you can still query them even after completely deleting the local DB file.
Certain schema issues are currently fatal or nearly unworkable with DuckDB. See https://github.com/duckdb/duckdb/issues/8018. This would be a big issue for us as many of our tables have changing columns over time.

dfsnow commented 1 year ago

@jeancochrane We should discuss this sometime next week. Curious to hear your thoughts! If I'm obviously bandwagoning and this looks like bunk to you I want to hear that too.

Note: Another use-case could be transpiling Athena SQL to DuckDB using https://github.com/tobymao/sqlglot, then testing view and table definitions on the local DuckDB instead of an Athena test env.

dfsnow commented 7 months ago

This was a dumb idea. Athena is doing all the schema versioning/tracking, something DuckDB doesn't yet handle well. That said, I think there might be use for Duck elsewhere in our stack (e.g. as a backend for web apps).

ccao-data / data-architecture