dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.62k stars 1.59k forks source link

hide model from docs #1671

Closed mike-weinberg closed 4 years ago

mike-weinberg commented 5 years ago

Describe the feature

for a project with intermediate steps like B in A -> B -> C it is sometimes the case that the intermediates are not intended for general consumption, but they will show up in DBT generated docs anyway.

For large products with many intermediates, it may be desirable to document the source tables and any final tables used for reporting in production, but it may be preferable for the intermediates (which could be tables, ephemeral CTEs, or temp tables) to be treated as a black box and kept hidden from the generated user-facing documents.

From an implementation perspective, it seems like a good place to put this would be in the <model>.yaml files, to both maximize extensibility and encourage good DBT hygiene (make a YAML for every model, ~even if~ especially if it isn't in the final docs!)

Describe alternatives you've considered

Martin Guindon read our minds when he suggested the problem and a viable workaround

(from DBT slack https://getdbt.slack.com/archives/C0VLZM3U2/p1565030505180400)

Martin Guindon

@Mike Weinberg (WeWork) Curious about the use case too. Is it separating internal/staging models from the business-ready models to get clean business documentation for fact tables without intermediate models? If so then perhaps splitting those in two projects and using your “staging project” as a package/dependency for the “business project” would do the trick. I haven’t tried this so perhaps it wouldn’t do what I’m thinking… Maybe Claire or Drew can jump in to tell me if my idea makes sense?

Mike Weinberg (WeWork)

Martin nailed it. We have various intermediate tables built after some final tables are built, but before the last final tables are built, and so splitting projects as a hack for documentation is not highly desirable.

We definitely want this in core and may be able to find resources to contribute to support this use case, but we would want to partner to build it the "right" way.

Mike Weinberg (WeWork)

We would greatly prefer for encapsulation concepts like projects to be meaningful, rather than splitting projects whenever there are intermediates that we would rather hide.

As suggested here, DBT projects form a powerful abstraction, and splitting a project in two as a means to achieve the above is a viable solution, however it increases the number of projects to maintain and creates the appearance that one pipeline is really two. Ideally if a sub-dag is not reusable, we would opt to keep it inside a single project.

Additional context

Additional considerations include:

Who will this benefit?

This benefits organizations with large projects in which there are many intermediate steps. Intermediates often exist for the purpose of accelerating builds by reducing redundant work, but the intermediates may themselves not be valuable for reporting, or may hold PII, etcetera. As is, DBT shows these models in the generated documentation for a project, and when there are a large number of intermediates, it can be difficult for end users to navigate the documentation for the models that are actually accessible and relevant to them.

This change primarily benefits the decision makers and analysts who leverage documentation as a data dictionary to understand the meaning of relations and columns, and how they relate to each other. It does this by limiting docs to only those models which are relevant to end users.

Slack Conversation

drew.banin [8 hours ago] got it - thanks for the context! I can imagine this working in a couple of different ways:

  1. add a config on models (+ other resources?) which controls whether or not the docs site renders them. This config will be compiled into the manifest in dbt docs generate
  2. create a separate config file which controls resource visibility / configuration in the docs. I can imagine showing / hiding different parts of the docs (I know some users want to hide model SQL, for instance).

One question: if your graph looks like: A ---> B ----> C

and you’ve marked B as “hidden”, would you expect to see A ---> C

in the docs DAG view?

Mike Weinberg (WeWork) [3 hours ago] I think we are indifferent about the display of intermediates in the data lineage. End users definitely don't need (or want) to see tables they will never use, and developers don't need to see the graph because they wrote the damn dag, to borrow a phrase from a senator from vermont.

Mike Weinberg (WeWork) [3 hours ago] as a result, the choice of if and how to show the sql is dependent on if we show the intermediate or not.

If we hide B but want to show the sql for C, I would treat the code that generates it as a CTE and show the compiled sql for C using the code for B as an inlined CTE.

If we hide B and don't show the sql for C, it kinda doesn't matter. (edited)

Mike Weinberg (WeWork) [2 hours ago] as for implementation preference, I polled 4 of our heaviest DBT users. All supported (as a preference) adding a hide-from-docs param to the <model(s)>.yaml file but felt that it would be perfectly fine to put it in the sql as well. Their reason in favor of putting it in the model yaml was that it encourages analysts to document-the-undocumented models.

From a developer perspective, I support putting it in the model yaml because it is more flexible - nested configuration options for the permutations you mentioned make more sense in something like yaml than as params in a macro. I think it's also more extensible.

As for what sql to show when B is hidden, the consensus was that the sql for C should be hidden too and we should point users to github, because if they are sophisticated enough to read complex sql, they are probably not going to be overly frustrated by being told to look at source code.

mike-weinberg commented 5 years ago

@drewbanin I realize I never directly responded to the following:

create a separate config file which controls resource visibility / configuration in the docs. I can imagine showing / hiding different parts of the docs (I know some users want to hide model SQL, for instance).

I think if we were to treat this sort of like github's CODEOWNERS in which developers could provide declarative (pattern based) show/hide rules on various aspects of docs then this makes sense. That seems like the most extensible design.

In the long run I like this. In the short run if it is easier to implement this in the model yaml as a per-model option, we should start there and at some point in the future we could figure out how to do a an access-control config file and make it so that you can mutually exclusively use either the access control config or the per-model options, so that there are no precedence rules to worry about.

drewbanin commented 5 years ago

Thanks for the great writeup @mike-weinberg! In general, I love the idea of being able to configure attributes of the docs site from inside of a dbt project. This is only tangentially related, but we have an issue for color coding data sources, which could probably be configurable via the same mechanism as the show/hide config described here.

I'm picturing a docs attribute in the schema.yml config, maybe something like:

version: 2

models:
  - name: my_model
    docs:
      show: false

  - name: other_model
    docs:
      color: "#def456"

sources:
  - name: public
    docs:
      color: "#abc123"

      tables:
        - name: numbers
          docs:
            show: false

dbt would just be responsible for picking up these configs and dropping them into a docs key in the node's entry in the manifest.json file. From there, we can pretty readily add logic to show/hide nodes, change colors, or otherwise customize the docs in the browser.

I think that this is a passable implementation of the feature you're describing, but I think it also leaves some things on the table. I think there also might be merit to adding a similar config to dbt_project.yml. That way, you can configure whole groups of models by their package name or directory. That might look like:

# dbt_project.yml

models:
  my_project:
    intermediate:
      schema: "intermediate"
      docs:
        show: false

  some_package:
    docs: false

I think this will benefit from https://github.com/fishtown-analytics/dbt/issues/1503, in which you'll be able to reference the target variable from the `dbt_project.yml config, so you could do something like:

# dbt_project.yml

models:
  my_project:
    intermediate:
      schema: "intermediate"
      docs:
        show: {{ target.name == 'dev' }}

This would render the complete docs in development mode, but hide some models in production deployments of the documentation.

I think a good place to start here would be to:

Support the following fields in a docs: dict:

A compiled node in the dbt manifest currently looks like this (lots of fields removed for brevity):

        "model.my_new_package.incr": {
            "name": "incr",
            "resource_type": "model",
            "package_name": "my_new_package",
            "raw_sql": "..."
            "unique_id": "model.my_new_package.incr",
            "config": { ... },
            "schema": "demo_schema_ok",
            "database": "analytics",
            "alias": "incr",
            "columns": {},
            "description": ""
        },

I think if we can include the "docs" config here, then we'll have everything we need to make this happen in the docs site :)

Some constraints that may prove helpful:

Let me know what you think about all of this!

mike-weinberg commented 5 years ago

@drewbanin I'm completely on board with this proposed interface and roadmap for this feature.

regarding

I think it should be easy to tie the two of these together, but we'd have to consider config precedence

This might seem like a hot take but in the event that a doc-show settings on a given model are specified in both the project.yml and schema.yml then I think an error should be returned rather than applying a precedence rule, because I think precedence rules are one of those things users are least likely to memorize. (strong opinion, weakly held)

tayloramurphy commented 5 years ago

One weird use case we have sometimes is that we need to temporarily disable a model from building (b/c of a failing test or bad data or something) but we still want the docs to show for that model. Do you think this could be abstracted enough so that you could separate build from docs at the dbt_project.yml level?

Something like:

      netsuite_stitch:
        build: false
        docs: true
        base:
          materialized: table
        xf:
          materialized: table

but enabled: false would set both build and docs to false?

mike-weinberg commented 5 years ago

Hey @tayloramurphy! That makes sense. That being said, in the interest of managing scope for this issue, lmk if the following alternative might better solve that need:

Would it be better if instead you had the ability to specify a build option in dbt run such that certain build steps are skipped, like --skiplist "model1, model2, model3" (or something like that)? Modifying a static configuration file for a one-off non-standard run carries the risk of that change being forgotten or accidentally committed to source control, whereas if you could specify it in dbt run the one-off parameter would live on your command line rather than in a configuration file.

What do you think?

tayloramurphy commented 5 years ago

change being forgotten or accidentally committed to source control

Our preference would be to have the change in source control and not as a run-time configuration. Every change we make goes through a PR review process so there would be follow-up issues to rectify and we'd now be able to point other people into the company to where and how the changes were made.

Totally get wanting to manage scope on the issue. My only aim was to see if there was a higher level of abstraction possible around this since someone will be touching that code anyways.

mike-weinberg commented 5 years ago

oh I see, so if you do a backfill, you'd want the fact that a backfill was run to show up in source control, is that right? Out of curiosity, how do you guarantee that the backfill only runs the one time, or do you manually manage that change?

We've been talking about something like this internally too. ORMs support migrations for this stuff, and we are contemplating a similar process to track one-off backfills in version control. I think we lean more toward having a one-time task in a project-specific backfill (airflow) dag which would execute differently than the standard build. we are playing around with options, but in general we're looking more upstream, at our scheduling infra, because backfills / task success/failure management is one of the things workflow schedulers are (currently) better at than dbt-core. (EDIT: specifically, their advantage is that some of them are stateful - airflow has a transactional database for managing this stuff, which provides all sorts of benefits along with the added complexity)

A DBT native migration concept for managing backfills in a source-controllable way definitely would be awesome, in my opinion! (but also maybe out of scope for a show/noshow docs config MVP?)

drewbanin commented 5 years ago

@tayloramurphy when you temporarily disable a model like this, what do you do about the models that depend on that model? I believe dbt should show you a compilation error in this scenario. Do you disable all of the downstream models as well? Very happy to discuss this further, but think I'd prefer to do so in a separate issue!

drewbanin commented 5 years ago

@tayloramurphy just to follow up, my broad thinking is that this should be implemented with some sort of no-op materialization? I think that strikes the right balance between:

  1. preserving the model in the DAG
  2. rendering the model in the documentation
  3. not actually running anything against the database

This will of course fail if the destination table doesn't exist and downstream consumers depend on the model, but it sounds to me like that's a fair and reasonable tradeoff here. Definitely feel free to create a new issue if you'd like to discuss further!

mike-weinberg commented 4 years ago

I'm realizing belatedly that a nice to have would be the option to distinguish between hiding models from the index/data-dictionary vs hiding them from the lineage viz. The data dictionary is kinda for everyone, while the lineage is for the people who may be trying to debug something and need a high level picture to get started. Since the default behavior is to show things, and we want to make the config as readable as possible, it might be better if we have a hide list, rather than a show switch.

This might look like:

...
    docs:
        hide:
            - data-dictionary # completely hides from the index page, and there would be no data dictionary pages generated for the hidden models. 

This is what I would expect to see most of the time - sources and intermediates might be hidden from docs in most project pages.

if, say, lineage were to be added to that list, then the table would not show up in the lineage graph either. I think this would happen less often, as the lineage should probably be complete, since as I said earlier, those who care about the lineage are more likely to need to know about the build process and are much more likely to use it as a guide for navigating dbt project source code.

This also means someone could hide a model from the lineage graph but still document the model, which seems like a rare but plausible requirement.

drewbanin commented 4 years ago

@mike-weinberg I gave this broader concept some thought recently, and here's where I ended up: a single list of show/hide configs is probably too simplistic for the use case that you're describing. What we really need is a notion of "layers" (better word to come, hopefully) which control the view into the specified dataset. This would be less about showing/hiding specific models, and more about describing "Here's the appropriate view of the docs for user persona X/Y/Z". That might include showing/hiding different nodes, rendering the file tree or database view (or both), and, in the future, showing things like ERDs for only the subset of "output" models meant to be consumed directly.

I think that's broader than the scope of the implementation in #2107, but it's definitely something we'd be interested in supporting more completely in the future!

drewbanin commented 4 years ago

the second half of this will be implemented in https://github.com/fishtown-analytics/dbt-docs/issues/68

mike-weinberg commented 4 years ago

@drewbanin I think that with https://github.com/fishtown-analytics/dbt/issues/1671#issuecomment-584412915 you are correctly interpreting my intent. complex scenarios may arise, and non-trivial configuration options may be required. I'm definitely supportive of the direction you're leaning.

I do wonder if the idea of personas starts to couple DBT core with the paid product, since the configuration you suggest might only be relevant if the docs server supports authentication and contains a global list of users and their roles. If that's the case, it could be worth considering implementing this twice, once in a simple way, and a second time for a future independent dbt-docs-server service, so that dbt-core doesn't expand in too many directions. The "old style" docs could be supported by the docs server, but superseded by any configurations on the server which would conflict.

drewbanin commented 4 years ago

fixed in #2179

itajaja commented 2 years ago

does this work with sources? can't seem to make it work

The-Pavel commented 1 year ago

does this work with sources? can't seem to make it work

Indeed, looks like this is not implemented for sources - any update on this in the works? 🙏

Screenshot 2023-03-01 at 3 01 39 PM
remilepriol commented 1 year ago

To add another entry on this issue that has been closed for 3 years, I would also benefit from hiding sources from the docs. It would be helpful to our average users as they simply confuse sources and staging models at this point.

paulplayer commented 1 year ago

does this work with sources? can't seem to make it work

Indeed, looks like this is not implemented for sources - any update on this in the works? pray

Screenshot 2023-03-01 at 3 01 39 PM

We have the same question in our project. Any answers to hide model for sources in dbt docs?