ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
6 stars 4 forks source link

Investigate replacing AWS Athena with DuckDB #42

Closed dfsnow closed 7 months ago

dfsnow commented 1 year ago

DuckDB is the new thing in the data engineering space right now. It's basically SQLite but for analytical workloads instead of application workloads. It might fit our use-case pretty well and simplify some existing/future complexity. We should investigate its viability for our data at some point in the near future.

DuckDB's biggest value add for us is its ability to read Parquet files directly from S3. Something like the code below could be used to create a local database file full of views that are effectively pointers to S3 buckets. We could then build views on top of those pointers, just like in Athena. This would essentially make a DuckDB file our main query layer, replacing Athena.

library(duckdb)

conn <- DBI::dbConnect(duckdb("test.db"))

dbExecute(
  conn,
  "
  INSTALL httpfs;
  LOAD httpfs;
  SET s3_region='---';
  SET s3_access_key_id='---';
  SET s3_secret_access_key='---';
  "
)

dbExecute(
  conn,
  "
  CREATE OR REPLACE SCHEMA model;
  CREATE OR REPLACE VIEW model.shap AS
    SELECT *
    FROM parquet_scan('s3://bucket_name/shap/year=2023/run_id=2023-03-15-clever-kyra/*/*.parquet', hive_partitioning=true)
  "
)

shap <- dbGetQuery(
  conn,
  "
  SELECT *
  FROM model.shap
  WHERE run_id = '2023-03-15-clever-kyra'
  AND year = '2023'
  AND township_code = '77'
  "
)

Why (would we do this)

How (would we do this)

I think DuckDB could essentially be a drop-in replacement for Athena. We would create stub views (as shown above) that act as tables currently do in Athena. We would then just build our existing views on top of those pointers.

The result would be a local file that contains pointers and view definitions for our entire data lake. This file would be built and tested via GitHub Actions and then uploaded to S3. From there, we could download a single local copy of the file to the main Data Dept. server, and point all connections to that file. Any changes to the views or underlying data would kick off a new workflow and replace the local file.

If we wanted, we could also just build the entire current version of the data lake into the DB file. This would likely speed things up even further for local queries on the server.

Complications

dfsnow commented 1 year ago

@jeancochrane We should discuss this sometime next week. Curious to hear your thoughts! If I'm obviously bandwagoning and this looks like bunk to you I want to hear that too.

Note: Another use-case could be transpiling Athena SQL to DuckDB using https://github.com/tobymao/sqlglot, then testing view and table definitions on the local DuckDB instead of an Athena test env.

dfsnow commented 7 months ago

This was a dumb idea. Athena is doing all the schema versioning/tracking, something DuckDB doesn't yet handle well. That said, I think there might be use for Duck elsewhere in our stack (e.g. as a backend for web apps).