dbt-msft / tsql-utils

dbt-utils for the dbt-msft family of packages
MIT License
25 stars 25 forks source link

How to follow the original dbt package and why a submodule #36

Closed b-per closed 2 years ago

b-per commented 3 years ago

Hi,

I am in the process of creating TSQL shims for the great packages dbt-date and dbt-expectations (dbt-date-tsql and dbt-expectations-tsql) and I have followed a similar approach to tsql-utils, creating a submodule linking to the original package.

I am actually wondering, why are dbt shims following this approach and what are the advantages? Is this to track a specific version of the original package? But in that case, shouldn't .gitmodules be defining a specific branch or commit from this package?

And second question, once the packages are ready, would it make sense to move them under the dbt-msft umbrella?

Cheers,

dataders commented 3 years ago

First, I'm stoked @b-per that you're working on those TSQL shims -- these were next on our teams list to shim/port. Welcome to dbt-msft! As for your questions... ramblings below.

are submodules tied to specific commit/release?

Is this to track a specific version of the original package? But in that case, shouldn't .gitmodulesbe defining a specific branch or commit from this package?

Yes, if you look at dir of this package on GitHub, you'll see a dbt-utils @ bbba960 which means a specific commit of the dbt-utils repo. I'm no expert on git submodules, but somehow the commit of the submodule is tracked outside of the .gitmodules file.

why shim?

why are dbt shims following this "creating a submodule linking to the original package" approach and what are the advantages?

maintenance strategy/philosophy

When it comes to adding/modifying logic to an existing project, the choices are:

option description maintainer impact contributor impact
merge merging in new logic/code to original project larger codebase, more work to maintain least amount of work
fork forking the entire project and changing the logic of the fork less control of UX for fork users duplicated code, effort to keep current with upstream
shim shimming the new logic into a new plugin that can be the original package can run more work upfront to build in plugin support, low effort long-term more work upfront, low effort long-term

Given that there are now 6+ custom adapters, I would not want to be in charge of maintaining dbt-utils for all those adapters and the requisite knowledge of the 10+ databases. shimming does seem to make the most sense.

That said, this was the choice for the maintainers of dbt-utils, but maintainers of other packages (e.g. dbt-ml-preprocessing and dbt-external-tables), have welcomed the intra-packaage support for dbt-msft's adapters.

package ecosystem

dbt-core has evolved significantly in the past year to allow better support community plugins and packages.

One of most mind-blowing aspects of dbt is that every dbt project is a package and packages can have sub-packages. For example, tsql-utils is defines three dbt projects on top of dbt-utils's two. So, IMHO, adapter-specific shim packages fit right into this philosophy of interoperable, modularized, and purpose-built projects.

Also in theory, the officially supported adapters could be in their own repos, but would probably add unnecessarily complexity for the maintainers.

yes, and...

Speaking of unnecessary complexity... perhaps we could bundle the shims for dbt-utils, dbt-date, and dbt-expectations all into tql-utils (maybe salted with some non-shimmed useful TSQL macros like grant_access)? I think we could do this, but just because we could doesn't mean we should... @jtcohen6 @alittlesliceoftom, would love your $0.02 when y'all have a moment.

b-per commented 3 years ago

Thanks for taking the time to reply @swanderz !

I think it answers some of my questions but also raises new ones :-)

It seems that there are 2 ways of managing dependencies to other dbt packages.

  1. using submodules, like in this repo
  2. or loading the 3rd party packages in the packages.yml file of the newly developed package, like in dbt-expectations here. In that case, when using the new package, the dependant ones don't need to be loaded explicitly, they are installed automatically after a dbt deps

And my question then becomes why shims like tsql-utils and spark-utils are currently using the option 1 but other packages go with option 2.

dataders commented 3 years ago

why shims like tsql-utilsand spark-utils are currently using the option 1 but other packages go with option 2.

TL;DR I'm 95% sure that this is a false dichotomoy and the two options you present solve different problems

option 1: extend main packages macros to work for other adapters.

tsql-utilsand spark-utils both add functionality to dbt-utils by taking advantage of how every single macro in dbt-utils uses dbt's adapter.dispatch() method (helpful docs). With option 1, end users never invoke tsql-utils.concat() or spark-utils.concat(), it is just always dbt_utils.concat() regardless of the adapter you are using. This is

For an example of this, check out:

Another great perk is that tsql-utils can lever dbt-utils's integration testing capabilities. It's a quick and easy way to validate that the shimmed macros work on the custom adapters as intended.

option 2: cross-package macro dependency

making a new package of macros that themselves rely on macros defined in other packages

dbt-expectations's expect_row_values_to_have_recent_data.sql has the perfect example of how it's dependencies are used. You can see that the test_expect_row_values_to_have_recent_data macros makes use of dbt_utils.dateadd() and dbt_date.now()

{% macro test_expect_row_values_to_have_recent_data(model, column_name, datepart, interval) %}
select
    case when count(*) > 0 then 0
    else 1
    end as error_result
from {{model}}
where
    {{column_name}} >=
        {{ dbt_utils.dateadd(datepart, interval * -1, dbt_date.now()) }}
    and
    {{ column_name }} <= {{ dbt_date.today() }}
{% endmacro %}
jtcohen6 commented 3 years ago

Speaking of unnecessary complexity... perhaps we could bundle the shims for dbt-utils, dbt-date, and dbt-expectations all into tql-utils (maybe salted with some non-shimmed useful TSQL macros like grant_access)? I think we could do this, but just because we could doesn't mean we should... @jtcohen6 @alittlesliceoftom, would love your $0.02 when y'all have a moment.

I think it's totally reasonable to bundle like that! That's what I've been doing over in spark-utils, which shims macros for dbt-utils + snowplow, and could have some other spark-specific utilities / good stuff. It's up to you as maintainers, at the end of the day.

I think it's possible to do this specifically because of the distinction you laid out nicely above. A shim package can be used alongside any number of packages for which it provides compatibility. By setting up integration testing via git submodule, rather than explicit package dependencies, installing a shim package for compatibility with one package does not actually require the installation of all the other packages for which it provides coverage.

dataders commented 2 years ago

closing as I think this is resolved. let me know if otherwise