This PR uses the Python model dependency deployment system established in https://github.com/ccao-data/data-architecture/pull/435 to power the Python dependencies in reporting.ratio_stats, our first Python model. We create a new table python_model_dependency that gets referenced in reporting.ratio_stats via a dbt.ref() call as an indirect way of calling get_s3_dependency_dir() in the context of the Python model code.
Defining a Python version of the get_s3_dependency_dir() macro that we could call directly from the context of the reporting.ratio_stats Python model. This is impossible due to limitation 1) above, since there is no equivalent of macros for Python models yet. We could think about deploying a separate bundle to S3 just for this one macro, but we wouldn't be able to namespace it properly by user or branch in dev/CI environments, since scripts would need to pull the bundle containing get_s3_dependency_dir() before they know the location of their S3 dependency dir in the first place. If Python macros were supported, this alternative solution would have entailed Python model code looking something like this:
from macros import get_s3_dependency_dir
sc.addPyFile(f"{get_s3_dependency_dir()}/reporting.ratio_stats.zip")
def model(dbt, spark_session):
...
Passing in the value returned by the SQL get_s3_dependency_dir() macro via configs. This is impossible due to limitation 2) above, since only a subset of builtin macros are available at the time when schema files are compiled (see discussion here). If all macros were available at compile time, this alternative solution would have entailed a schema file looking something like this:
Closes #439. Note the one extra task from #439 that isn't completed as part of this PR (adding docs for Python models) -- I think I'd prefer to spin that off into a follow-up issue if you're comfortable with it, since some aspects of our use of Python models are still unter active consideration (e.g. which types of transformations should use Python models vs. SQL models vs. Python/R scripts).
Agreed with your proposed design @dfsnow! I'm going to merge this in as-is so that we can preserve a commit with the current design, but I'm going to open up a fast follow that refactors to simplify things.
This PR uses the Python model dependency deployment system established in https://github.com/ccao-data/data-architecture/pull/435 to power the Python dependencies in
reporting.ratio_stats
, our first Python model. We create a new tablepython_model_dependency
that gets referenced inreporting.ratio_stats
via adbt.ref()
call as an indirect way of callingget_s3_dependency_dir()
in the context of the Python model code.The design is a little bit counterintuitive due to the fact that Python models 1) currently have no equivalent to macros that would allow us to reuse code and 2) only support accessing project context that is passed in via config variables. If it weren't for these two limitations, I would have preferred one of two alternative solutions:
get_s3_dependency_dir()
macro that we could call directly from the context of thereporting.ratio_stats
Python model. This is impossible due to limitation 1) above, since there is no equivalent of macros for Python models yet. We could think about deploying a separate bundle to S3 just for this one macro, but we wouldn't be able to namespace it properly by user or branch in dev/CI environments, since scripts would need to pull the bundle containingget_s3_dependency_dir()
before they know the location of their S3 dependency dir in the first place. If Python macros were supported, this alternative solution would have entailed Python model code looking something like this:get_s3_dependency_dir()
macro via configs. This is impossible due to limitation 2) above, since only a subset of builtin macros are available at the time when schema files are compiled (see discussion here). If all macros were available at compile time, this alternative solution would have entailed a schema file looking something like this:Closes #439. Note the one extra task from #439 that isn't completed as part of this PR (adding docs for Python models) -- I think I'd prefer to spin that off into a follow-up issue if you're comfortable with it, since some aspects of our use of Python models are still unter active consideration (e.g. which types of transformations should use Python models vs. SQL models vs. Python/R scripts).