[CT-3528] [Feature] Define a "yeslist" of downstream projects that can reference specific `protected` model(s)

jtcohen6 commented 9 months ago

Is this your first time submitting a feature request?

[X] I have read the expectations for open source contributors
[X] I have searched the existing issues, and I could not find an existing issue for this feature
[X] I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

prompted by Slack conversation with @eivind-stb

There are currently three access modifiers that determine where a model can be ref'd:

private: can only be referenced by resources in the same group
protected: can only be referenced by resources in the same project/package
public: can be referenced anywhere (including other projects/packages)

I've heard from a number of folks who want something in between protected + public: "public, but limited to this project + [project x, project y]"

Rather than a "downgrade" of public, I think this is actually an "upgrade" of protected: "only this project, plus specific other projects [if/as declared]"

I'm imagining a new model config, which I'll call protected_yeslist for now (naming suggestions welcome!):

models:
  - name: my_protected_model
    config:
      access: protected  # optional, since it's already the default
      protected_yeslist:
        - name_of_other_project
        - name_of_another_project

This should work for both types of cross-project references:

project dependencies (docs - a feature of dbt Cloud Enterprise)
package dependencies with restrict-access: True (docs)

As a starting point, we'd want to update the logic here that determines whether a ref to a protected model is valid.

Thinking about complementary experiences (dbt Explorer): If a model is protected, I would not expect it to be discoverable by anyone (as truly public models are). It's available on a "need-to-know" / "need-to-ref" basis.

Describe alternatives you've considered

Modifying the behavior of public models rather than private models. This seems to be most people's first intuition when asking for this capability. But I find the extension of protected much more satisfying! As @eivind put it:

Ironically I think the fourth access modifier should have been named "protected" and the current "protected" for within a project should be named "project" or "local".

This gets us to that desired nomenclature, without any change to existing behavior.

Not doing this. Model access in dbt is really about access to metadata (discoverability); it shouldn't be mistaken for access on the underlying data, which is managed via grants (or more granular access policies) within the underlying data platform. Though this can be facilitated by dbt, via the grants config, it's not dbt's ultimate responsibility.

Who will this benefit?

Folks adopting dbt Mesh who want to limit metadata access to specific models/projects

Are you interested in contributing this feature?

always :) just a question of timing!

Anything else?

No response

eivind-stb commented 9 months ago

I agree with the proposed solution, and for our company, with a dbt mesh implementation, protected with a defined yeslist (I suggest protected_whitelist) will be of more use than straight up public models.

I also agree that I would not expect to find any models with a protected_yeslist in the list of public models in dbt explore.

katieclaiborne commented 6 months ago

Is there value in considering this question from the project level, rather than the model level?

My use case might be slightly different, in that I'm thinking about model access as a way to shape our account-level project lineage graph. I'm happy with full transparency around metadata and discoverability, but I'd like to have a mechanism to guide the relationships created through cross-project ref.

With cycle detection at the project level, I've been trying to optimize our cross-project graph the same way I'd optimize a single project's. Do the same modeling concepts apply? The ones I hold close are those defined in dbt project evaluator, which overlap with those from the inherited project refactoring session at Coalesce 2022.

Let me know if I'm too far afield here, but I think I'd prefer to have a way to define which projects can access all of a given project's public models, rather than determining cross-project access at the model level.

jtcohen6 commented 6 months ago

@katieclaiborne Thanks for thinking through this!

In the case of:

project_a --> project_b --> project_c

It sounds like you want a way to say, "The models in project_c should never be able to reach out and access the public models in project_a."

I have a few more questions, if you're willing to humor me:

Is that an expectation you'd expect to put in place within project_a?
As a "yeslist" (only project_b can access) or as a "nolist" (every project except project_c can access)?
How would your preference change if dbt were to support cycle detection at the node level, rather than the project level? Would node-level access restrictions feel more appropriate?

katieclaiborne commented 6 months ago

Of course!

It's an expectation I'd imagine putting in place within project_a, as a "yeslist". I wondered about having a references.yml file, as a companion to dependencies.yml. The first would define which projects are allowed to reference the root project, just as the second defines which projects the root project references.

Yes, node-level access restrictions would feel more appropriate to me if dbt were to support cycle detection at the node level. I've also wondered whether model groups could serve as a middle ground!

I'm wrestling with how to observe and evaluate our account-level project relationships. The project DAG in Explorer is great, but some of the emerging relationships have my pattern recognition brain going haywire, when really, there may not be cause for concern.

jtcohen6 commented 6 months ago

@katieclaiborne Following up from our conversation last week! It feels like we were getting at a distinction between:

One team/project leveraging a "final" model from another team/project — a "data product" that should be searchable/discoverable for everyone, and reference-able across projects
Cross-project references for "internal" purposes, i.e. staging models for one source in a common staging project → the 1-2 domain project(s) which will leverage those staging models

It feels to me like the proposal in this issue is in keeping with that distinction:

(1) is the currently supported pattern — discovery + reference-ability of public models by anyone
(2) is the pattern you are describing —protected models for specific downstream use

This also feels in keeping with the (loose) inspiration we're taking from other object-oriented languages, where "protected" means within same package and/or "friend" classes.

I think this is what that might look like in practice:

# common_staging_project/dbt_project.yml

models:
  common_staging_project:
    staging:
      finance_stuff:
        +access: protected
        # should we call this 'derived_projects', or 'friends' ? :)
        +protected_yeslist: ['finance']
      +marketing_stuff:
        +access: protected
        +protected_yeslist: ['finance']

I've also wondered whether model groups could serve as a middle ground!

Is your idea here that, rather than defining this as a new config (protected_yeslist) — so long as the group config matches across both projects — then the reference to a protected model in the other project is allowed? (Thanks to @jenna-jordan's comment here which helped this click for me.)

That's an interesting idea!

I do like the economy of configs (reusing existing, don't need a new one)
I would still be inclined to keep this extensibility restricted to protected models (rather than private)
Project names are slightly more strongly typed (required to be globally unique)
Either of these can be "spoofed" by the downstream user (project name or group name) — meanwhile access to actual underlying data is still managed via DWH RBAC, and access to the relevant metadata in dbt Explorer is managed by dbt Cloud RBAC

The upshot of that change would be:

private models can be referenced by other models in the same namespace (project/package) AND same group
protected models can be referenced by other models in the same namespace (project/package) OR same group
public models can be referenced in anywhere (any namespace, any group, etc)

katieclaiborne commented 5 months ago

Yes, I like it! To be honest, I hadn't thought through the groups implementation that far. Thanks to you and Jenna for articulating an elegant design.

My mind immediately goes to how we might visualize groups as they exist across projects (as in a slightly more granular version of the project graph in dbt Explorer), but that's well beyond the scope of this issue.

jtcohen6 commented 5 months ago

The more I think about this:

I like the idea of reusing the existing group config.
I think the opt-in needs to be "producer-side." That is, we need to give the upstream project some way of specifying exactly which downstream projects can use its protected models. (Otherwise, any "consumer" project could just configure a group with the same name, and start using them.)

To reconcile these two requirements, I think the producer-side group needs an additional attribute. Following the existing example, this could look like:

# common_staging_project/models/groups.yml
groups:
  - name: finance
    owner:
      email: zach.jaff@jaffleshop.com
      name: Zach Jaff
    projects: # default: this project only
      - common_staging_project
      - jaffle_shop_mesh_finance

What does this mean?

The finance group "extends" across both the common_staging_project and the jaffle_shop_mesh_finance
If a consumer project (jaffle_shop_mesh_finance) is named in the producer project's group, then models in the consumer project which also belong to the same group can reference its protected models
The downstream project also does not need to redefine the group + owner , because these are already defined in the upstream project. If these projects tend to be maintained by different teams, the upstream project is saying, "These models are mostly relevant to (if not also directly owned by) the downstream team." This describes some of the hub-and-spoke patterns we're seeing in practice.
Downstream projects can never directly reference private models. For references to private models, both the namespace (project/package) and the group must match.

I see the primary risk of this approach as overloading (and confusing) the groups feature. Right now, groups are always a subset of project namespaces. (Even for this, there's already an exception: installed packages with restrict-access: False.) The idea that groups can extend across projects makes the diagrams more complicated, and the concept of ownership potentially more confusing.

dbt-labs / dbt-core