dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.8k stars 1.62k forks source link

[CT-3528] [Feature] Define a "yeslist" of downstream projects that can reference specific `protected` model(s) #9340

Open jtcohen6 opened 9 months ago

jtcohen6 commented 9 months ago

Is this your first time submitting a feature request?

Describe the feature

prompted by Slack conversation with @eivind-stb

There are currently three access modifiers that determine where a model can be ref'd:

I've heard from a number of folks who want something in between protected + public: "public, but limited to this project + [project x, project y]"

Rather than a "downgrade" of public, I think this is actually an "upgrade" of protected: "only this project, plus specific other projects [if/as declared]"

I'm imagining a new model config, which I'll call protected_yeslist for now (naming suggestions welcome!):

models:
  - name: my_protected_model
    config:
      access: protected  # optional, since it's already the default
      protected_yeslist:
        - name_of_other_project
        - name_of_another_project

This should work for both types of cross-project references:

As a starting point, we'd want to update the logic here that determines whether a ref to a protected model is valid.

Thinking about complementary experiences (dbt Explorer): If a model is protected, I would not expect it to be discoverable by anyone (as truly public models are). It's available on a "need-to-know" / "need-to-ref" basis.

Describe alternatives you've considered

Modifying the behavior of public models rather than private models. This seems to be most people's first intuition when asking for this capability. But I find the extension of protected much more satisfying! As @eivind put it:

Ironically I think the fourth access modifier should have been named "protected" and the current "protected" for within a project should be named "project" or "local".

This gets us to that desired nomenclature, without any change to existing behavior.

Not doing this. Model access in dbt is really about access to metadata (discoverability); it shouldn't be mistaken for access on the underlying data, which is managed via grants (or more granular access policies) within the underlying data platform. Though this can be facilitated by dbt, via the grants config, it's not dbt's ultimate responsibility.

Who will this benefit?

Folks adopting dbt Mesh who want to limit metadata access to specific models/projects

Are you interested in contributing this feature?

always :) just a question of timing!

Anything else?

No response

eivind-stb commented 9 months ago

I agree with the proposed solution, and for our company, with a dbt mesh implementation, protected with a defined yeslist (I suggest protected_whitelist) will be of more use than straight up public models.

I also agree that I would not expect to find any models with a protected_yeslist in the list of public models in dbt explore.

katieclaiborne commented 6 months ago

Is there value in considering this question from the project level, rather than the model level?

My use case might be slightly different, in that I'm thinking about model access as a way to shape our account-level project lineage graph. I'm happy with full transparency around metadata and discoverability, but I'd like to have a mechanism to guide the relationships created through cross-project ref.

With cycle detection at the project level, I've been trying to optimize our cross-project graph the same way I'd optimize a single project's. Do the same modeling concepts apply? The ones I hold close are those defined in dbt project evaluator, which overlap with those from the inherited project refactoring session at Coalesce 2022.

Let me know if I'm too far afield here, but I think I'd prefer to have a way to define which projects can access all of a given project's public models, rather than determining cross-project access at the model level.

jtcohen6 commented 6 months ago

@katieclaiborne Thanks for thinking through this!

In the case of:

project_a --> project_b --> project_c

It sounds like you want a way to say, "The models in project_c should never be able to reach out and access the public models in project_a."

I have a few more questions, if you're willing to humor me:

katieclaiborne commented 6 months ago

Of course!

It's an expectation I'd imagine putting in place within project_a, as a "yeslist". I wondered about having a references.yml file, as a companion to dependencies.yml. The first would define which projects are allowed to reference the root project, just as the second defines which projects the root project references.

Yes, node-level access restrictions would feel more appropriate to me if dbt were to support cycle detection at the node level. I've also wondered whether model groups could serve as a middle ground!

I'm wrestling with how to observe and evaluate our account-level project relationships. The project DAG in Explorer is great, but some of the emerging relationships have my pattern recognition brain going haywire, when really, there may not be cause for concern.

jtcohen6 commented 6 months ago

@katieclaiborne Following up from our conversation last week! It feels like we were getting at a distinction between:

  1. One team/project leveraging a "final" model from another team/project — a "data product" that should be searchable/discoverable for everyone, and reference-able across projects
  2. Cross-project references for "internal" purposes, i.e. staging models for one source in a common staging project → the 1-2 domain project(s) which will leverage those staging models

It feels to me like the proposal in this issue is in keeping with that distinction:

This also feels in keeping with the (loose) inspiration we're taking from other object-oriented languages, where "protected" means within same package and/or "friend" classes.

Screenshot 2024-04-23 at 20 52 31

I think this is what that might look like in practice:

# common_staging_project/dbt_project.yml

models:
  common_staging_project:
    staging:
      finance_stuff:
        +access: protected
        # should we call this 'derived_projects', or 'friends' ? :)
        +protected_yeslist: ['finance']
      +marketing_stuff:
        +access: protected
        +protected_yeslist: ['finance']

I've also wondered whether model groups could serve as a middle ground!

Is your idea here that, rather than defining this as a new config (protected_yeslist) — so long as the group config matches across both projects — then the reference to a protected model in the other project is allowed? (Thanks to @jenna-jordan's comment here which helped this click for me.)

That's an interesting idea!

The upshot of that change would be:

katieclaiborne commented 5 months ago

Yes, I like it! To be honest, I hadn't thought through the groups implementation that far. Thanks to you and Jenna for articulating an elegant design.

My mind immediately goes to how we might visualize groups as they exist across projects (as in a slightly more granular version of the project graph in dbt Explorer), but that's well beyond the scope of this issue.

jtcohen6 commented 5 months ago

The more I think about this:

To reconcile these two requirements, I think the producer-side group needs an additional attribute. Following the existing example, this could look like:

# common_staging_project/models/groups.yml
groups:
  - name: finance
    owner:
      email: zach.jaff@jaffleshop.com
      name: Zach Jaff
    projects: # default: this project only
      - common_staging_project
      - jaffle_shop_mesh_finance

What does this mean?

Screenshot 2024-05-13 at 13 32 07

I see the primary risk of this approach as overloading (and confusing) the groups feature. Right now, groups are always a subset of project namespaces. (Even for this, there's already an exception: installed packages with restrict-access: False.) The idea that groups can extend across projects makes the diagrams more complicated, and the concept of ownership potentially more confusing.