ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Explore using `dbt-athena`'s high availability option #460

Open dfsnow opened 1 month ago

dfsnow commented 1 month ago

Recently, we've merged some PRs that required rebuilding our full dbt project in Athena. A full rebuild deletes the existing tables from Athena before the queries to replace the tables are finished. The means that the tables are unavailable for the duration of the rebuild.

This is typically fine, as we can simply merge such rebuild PRs after work hours. However, as more people, applications, and process come to depend on the Athena warehouse, such downtime may become untenable.

The dbt Athena adapter has a built-in solution for this in the form of high availability tables. We should experiment with the capability to see if it works as promised. If it does, it should reduce our downtime and/or the consequences of a failed rebuild PR.

jeancochrane commented 1 month ago

Looks like this wasn't quite ready for production yet. Some failures we noticed after merging:

Looks like dbt is too dumb to clean up old, non-HA stuff i.f.f. the tables have partitions.

This means we have to delete the old prod data after switching over to the new scheme. Not very ergonomic, but not necessarily a deal killer.

dbt doesn't know/check if the stored data underlying a seed has been deleted, it just checks the state manifest to see if the seed data has changed. That means if you delete the seed data in S3, all the queries downstream of the seed will still run (to the extent they even can with a 0 row table).

We need to test if this behavior is specific to seeds, or to all models. Then, see if we can do anything about it.

I'm not sure exactly how the HA works, but it doesn't seem to be materializing the Parquet files in the prod bucket. Weirdly, there are delete markers at the paths to the tables. It seems like all the tables that solely use Athena/SQL work fine, but building with Spark which depends on the explicit existence of the files in S3 seems to fail.

Here's the error:

File "<stdin>", line 387, in <module>
  File "<stdin>", line 239, in model
  File "<stdin>", line 384, in <lambda>
  File "<stdin>", line 292, in ref
  File "<stdin>", line 378, in get_spark_df
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 700, in table
    return DataFrame(self._jsparkSession.table(tableName), self._sc, self._jconf)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Path does not exist: s3://ccao-athena-ctas-us-east-1/reporting/ratio_stats_input/abb06ce1-9e51-40e6-bdc5-aaceb2c5e2cc
jeancochrane commented 1 month ago

I'm having trouble recreating the critical failure mentioned above. First I built the DAG from scratch with HA tables, which worked except for reporting.ratio_stats which failed with a glue:DeleteTable permission error. Then I added glue:DeleteTable to the permissions for the Spark role and rebuilt just reporting.ratio_stats in a separate workflow run, which worked.

I wonder if perhaps there is some specific edge case centered around running a full DAG build when data already exists? Or if something is misconfigured in the prod S3 bucket? In any case, let's chat tomorrow and brainstorm ideas for debugging this further.