ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Use new view to pull dependency bundle in `reporting.ratio_stats` #453

Closed jeancochrane closed 1 month ago

jeancochrane commented 1 month ago

This PR uses the Python model dependency deployment system established in https://github.com/ccao-data/data-architecture/pull/435 to power the Python dependencies in reporting.ratio_stats, our first Python model. We create a new table python_model_dependency that gets referenced in reporting.ratio_stats via a dbt.ref() call as an indirect way of calling get_s3_dependency_dir() in the context of the Python model code.

The design is a little bit counterintuitive due to the fact that Python models 1) currently have no equivalent to macros that would allow us to reuse code and 2) only support accessing project context that is passed in via config variables. If it weren't for these two limitations, I would have preferred one of two alternative solutions:

  1. Defining a Python version of the get_s3_dependency_dir() macro that we could call directly from the context of the reporting.ratio_stats Python model. This is impossible due to limitation 1) above, since there is no equivalent of macros for Python models yet. We could think about deploying a separate bundle to S3 just for this one macro, but we wouldn't be able to namespace it properly by user or branch in dev/CI environments, since scripts would need to pull the bundle containing get_s3_dependency_dir() before they know the location of their S3 dependency dir in the first place. If Python macros were supported, this alternative solution would have entailed Python model code looking something like this:
from macros import get_s3_dependency_dir
sc.addPyFile(f"{get_s3_dependency_dir()}/reporting.ratio_stats.zip")

def model(dbt, spark_session):
    ...
  1. Passing in the value returned by the SQL get_s3_dependency_dir() macro via configs. This is impossible due to limitation 2) above, since only a subset of builtin macros are available at the time when schema files are compiled (see discussion here). If all macros were available at compile time, this alternative solution would have entailed a schema file looking something like this:
models:
  - name: reporting.ratio_stats
    config:
      s3_dependency_dir: '{{ get_s3_dependency_dir() }}'

Closes #439. Note the one extra task from #439 that isn't completed as part of this PR (adding docs for Python models) -- I think I'd prefer to spin that off into a follow-up issue if you're comfortable with it, since some aspects of our use of Python models are still unter active consideration (e.g. which types of transformations should use Python models vs. SQL models vs. Python/R scripts).

jeancochrane commented 1 month ago

Agreed with your proposed design @dfsnow! I'm going to merge this in as-is so that we can preserve a commit with the current design, but I'm going to open up a fast follow that refactors to simplify things.