Future art for exposures

jtcohen6 commented 3 years ago

Describe the feature

What other properties should exposures get?

[x] tags
[x] meta
- we do want some traditional meta fields (e.g. owner) to be required or top-level
- still could be nice as a catch-all for structured key-value properties users would want to define, beyond what'd be available via tags or description
[ ] new type options:
- "reverse pipeline", e.g. a census sync
- users supplying their own string types
[ ] new maturity options
- higher than high? "mission-critical"
- ...

Should exposures be ref-able?

exposures that depend on other exposures: one exposure for each Mode query / Looker view, one exposure for the dashboard that depends on those queries / views
models that depend on exposures: modeled input to data science --> data science as exposure --> modeled output
- what exactly would ref('my_exposure') return?

Describe alternatives you've considered

We're likely to keep these bare-bones for a little while. I'm still curious to hear what community members want!

Who will this benefit?

Users of the exposures resource type, which is new in v0.18.1

dwallace0723 commented 3 years ago

Is there any reason that type shouldn't just be an arbitrary string? Are y'all planning on having explicit functionality for the existing type values in the future?

dwallace0723 commented 3 years ago

Also - would be nice to have a field for URLs linking to the exposure itself.

jtcohen6 commented 3 years ago

Is there any reason that type shouldn't just be an arbitrary string? Are y'all planning on having explicit functionality for the existing type values in the future?

Yes, we're hard at work on some product features that tie into exposures, and thinking that we may want them to look or behave differently for different exposure types. I'd rather keep it structured for now and open it up later on if there's a lot of demand and variance, and few reasons to keep it limited.

Also - would be nice to have a field for URLs linking to the exposure itself.

There totally is, and I missed it when documenting! This is on me, I'll update that now (https://github.com/fishtown-analytics/docs.getdbt.com/pull/419).

amin-nejad commented 3 years ago

It would be nice if there was a commands field (similar to tox) where, if applicable, you could enter a command that would check the 'status' (whatever that may be) of the exposure after running the dbt models that it depends on. E.g. pytest, curl example.com/api/this-depends-on-dbt, etc.

I suppose this can already be done by something like tox anyway but I think it could be worth adding to dbt if other users are keen

jpau commented 3 years ago

models that depend on exposures: modeled input to data science --> data science as exposure --> modeled output

what exactly would ref('my_exposure') return?

@jtcohen6 An idea:

expose a source that backreferences it through a source's (new) depends_on attribute
to do this, use source(my_source) rather than ref(my_exposure)

Elaborated on below.

Thoughts?

Problem (re)statement

A DAG needs to tell us:

Execution order
Lineage

But take this current dbt DAG, and let exposure1 generate source2:

source1 → model1 → exposure1 source2 → model2 → final_exposure

In this example, the DAG is unhelpful because it doesn't tell us what we need to know (lineage, and execution order).

The idea

Instead ref(my_exposure) could expose a source that backreferences it through a depends_on attribute.

Ideally the DAG look something like: source1 → model1 → exposure1 → source2 → model2 → final_exposure

And in sources.yml:

sources:
  - name: my_source_name
    depends_on:
      - ref('my_exposure')
    tables:
      - ...

And more probably source(my_source) rather than ref(my_exposure).

Hmm

Snowflake's external functions are perhaps examples of exposures (if they ingest from dbt models) that do not add to a source directly, and so would not be correctly captured in the above. I don't think this is a big issue with the above?

arniwesth commented 3 years ago

We build views in Looker on top of dbt models, and we would like to use extensions to indicate this. We could use type:application, but ideally type should be extended (eg bi_layer, look_ml etc) or as otherwise suggested type should be user-defined.

aaronsteers commented 3 years ago

Couple questions:

Could exposures themselves have registered exposures? I'd like to model in the DAG an external cube (basically a caching layer which also applies drilling and drag-and-drop logic for analytical scenarios) when the external cube consumes a set of models - and the cube also supports a set of reports which directly consume the cube. When exploring the dependencies of a report-type exposure, we'd want to see if a cube supports it and also that the cube in question is supported by n specific tables.
Related to the above, can we have an exposure "type" for cache, cube, or composite dataset? Of the current options, "application" is probably the closest, but none exactly seem to fit the description of a cube-like caching layer - which itself is a source for dashboards and analyses.

aaronsteers commented 3 years ago

@arniwesth - re:

We autogenerate views in Looker based on dbt models, and we would like to use extensions to indicate this. We could use type:application, but ideally type should be extended (eg bi_layer, look_ml etc) or as otherwise suggested type should be user-defined.

I feel the same for the "cube" or "xmla dataset" layer of the Power BI stack. I like the idea of keeping type as enumerated (perhaps with an other option, but that's another topic) and I'm curious if there's a single term which could apply to both Looker BI and the Power BI stacks.

I like your idea of bi_layer and wonder if something like semantic model, bi model, cube, dataset, or something along these lines would fit to the Looker model and Power BI and other tools. (I think in Tableau the analogous concept of "cached-middle-layer-with-applied-biz-logic" is "Tableau Extract" and "Tableau Shared Data Source".)

aaronsteers commented 3 years ago

@jpau - Re your inquiry:

But take this current dbt DAG, and let exposure1 generate source2:

source1 → model1 → exposure1 source2 → model2 → final_exposure

In this example, the DAG is unhelpful because it doesn't tell us what we need to know (lineage, and execution order).

I'm not sure that this pattern can deterministically assure us that it is actually a DAG, or more specifically that the graph is acyclical. It looks like this approach would either have to allow circular loops (no longer a DAG) or else the initial run is not deterministic without directly synchronizing with the exposure.

Is it necessary for the consumption of the output from the exposure to be in the same logical DBT project? Or could you have a DBT project1 which generates input needed by the exposure and a DBT project2 which can safely assume your exposure? Breaking it into two projects seems like the best way to ensure the project can (a) benefit from static analysis and (b) make sure circular loops are not introduced (inadvertently or intentionally).

@jtcohen6 - Re your initial topic inquiry:

Should exposures be ref-able?

exposures that depend on other exposures: one exposure for each Mode query / Looker view, one exposure for the dashboard that depends on those queries / views

models that depend on exposures: modeled input to data science --> data science as exposure --> modeled output what exactly would ref('my_exposure') return?

Per my comments above, I'd personally prefer DBT prioritize the static analysis and definitely-a-DAG properties which are critical to system stability. That said, I would love for project to be able to reference another project's exposures, and perhaps that could even be performed as adding a reference pointer on a source definition like created_by: ref('other-project/exposure1'). I also would love for exposures to be able to reference other exposures - as in the example of a cube or bi_model exposure also having one or more of its own dashboard exposures which shows how end users actually consume the data.

Thoughts?

jpau commented 3 years ago

Thanks @aaronsteers

I'm not sure that this pattern can deterministically assure us that it is actually a DAG, or more specifically that the graph is acyclical.

I'm curious about this. This isn't obvious to me. Do you mind providing an example of where this style of dependency differs to others, in such a way that it creates a cycle?

Is it necessary for the consumption of the output from the exposure to be in the same logical DBT project? Or could you have a DBT project1 which generates input needed by the exposure and a DBT project2 which can safely assume your exposure? Breaking it into two projects seems like the best way to ensure the project can (a) benefit from static analysis and (b) make sure circular loops are not introduced (inadvertently or intentionally).

Interesting idea! In the given example, absolutely. But I don't think this is practical if you have several exposure-style dependencies spread throughout a modest DAG, such as

ML transforms. For marketing alone one might score each customer on churn propensity, attribution, segmentation, price sensitivity (and so forth), with a different exposure for each.
External data calls. You may clean, deduplicate, and transform raw data before passing that through to external APIs to get further data.

Splitting the project once is okay. I'd be concerned about it splitting it twice. I think anything beyond that would be opaque.

aaronsteers commented 3 years ago

I'm not sure that this pattern can deterministically assure us that it is actually a DAG, or more specifically that the graph is acyclical.

I'm curious about this. This isn't obvious to me. Do you mind providing an example of where this style of dependency differs to others, in such a way that it creates a cycle?

Sorry - I realize I wasn't very clear. So there are basically two scenarios I can think of where an exposure feeds back into project models of the same project. Both certainly make sense, and I think I've seen both in real-world (non-dbt) projects.

Taking the example of an ml-type exposure:

Scenario one: "Necessary pause" - the output of the ml model is the input to a source or a model which cannot be built without the completion of the exposure.
- This is not a circular dependency, but it does make static analysis difficult or impossible, since the external system is a source as well as a byproduct. In order to fully document and test a project, dbt must "pause and poll" until the external system completes the exposure processing.
- If parameterization across exposures are not identical to that of the main dbt project, there's also a likelihood of contaminating other environments. (For example: I build a unique dbt with distinct schema names for each CI/CD build number but my exposure does not have the same dynamic parameterization capabilities. This creates leakage or side-affects across environments if the exposure environment is not 1:1 with my dynamically created dbt environments.)
Scenario two: "Updates own source" - the output of the ml model is a required input for one or more models which in turn (directly or indirectly) feeds into models which are depended by in the exposure.
- This example is clearly a circular dependency, since subsequent executions of the exposure will (at least have the potential to) alter the input which feeds back into the model.

Notes:

In the scenario 1 option (which is to say the project has to pause and wait for exposure to complete) dbt is not able to invoke or document the end-to-end flow until and unless the exposure has been executed. Anything downstream from the exposure would have to pause and wait for the exposure to complete externally - which represents an orchestration and scheduling challenge.
- By splitting upstream and downstream into separate projects, we remove external state tracking as a challenge that DBT tracks. Essentially project1 is fully self-sufficient, and project2 takes for a given that project1 and the ml exposure have already both completed prior to its execution.
As of today, I don't think Scenario 2 would be supported within DBT, since circular dependencies are by definition not allowed in a DAG.
- You could technically still implement this in DBT today (meaning, the exposures and source could be the same thing without DBT knowing about it) but you would have a chicken-and-egg problem on your first time executing in a new environment.

Possibility to define exposures as operations:

Another option which resolves the issues with the "pausing and polling" needed by scenario 1 would be to pair the exposure feature with a predefined dbt operation feature. In that case, dbt would be able to directly call and execute the build of the exposure (in our example the ml model) and could therefor still execute an end-to-end pipeline without having to stop and block on an externally defined and externally orchestrated process.

@jpau - What do you think about this?

jtcohen6 commented 3 years ago

@aaronsteers I agree with your separation into the two scenarios, and the fact that we'd be more interested in building out supportive tooling for the first scenario. I just want to bring into conversation another recent issue (https://github.com/fishtown-analytics/dbt/issues/2894) on the subject of "3rd-party operations" about which dbt could or should know. A potential workaround for the time being is to have a placeholder model that stands in for the external process, allowing the dbt DAG to reflect your actual one. Separating out the workflows and orchestrating through a third-party tool would be the most robust, but if you have access to things like Snowflake external functions and system$wait, it wouldn't be impossible for the model itself to trigger and wait for the ml model's completion.

fuchsst commented 3 years ago

At the moment exposures are purely a documentation feature. In most cases they refer to another system (like Tableau) that might need to get a notification to finalize/proceed with the pipeline (in case of Tableau, trigger the data extraction for a data source or trigger a task). This might be in a very generic way (as the use case to check the status, described above). Or more specific, like do a REST call (with separate attribute fields for url, parameter, body, header, authentication) - so basically a webhook...

tnightengale commented 3 years ago

We autogenerate views in Looker based on dbt models, and we would like to use extensions to indicate this. We could use type:application, but ideally type should be extended (eg bi_layer, look_ml etc) or as otherwise suggested type should be user-defined.

Hey @arniwesth ! I'm thinking about doing the same (hence why I was crawling exposure-related threads). Would you be open to sharing your experience with this approach and perhaps collaborating on a dbt package together?

arniwesth commented 3 years ago

Hey @tnightengale. Unfortunately, I did not formulate this very well. What I actually meant, was that we just user Lookers "Create View From Table" to create models from dbt generated views/tables. It would be awesome if it could actually be automized, but it's not something we have looked into. Sorry for the confusion - I updated my comment to better reflect what we actually do.

tomsej commented 3 years ago

I really need tags and meta in exposures. @jtcohen6 What's the status, can I start working on it?

joellabes commented 3 years ago

I just wrote this in Slack:

Can you check with whichever stakeholder depends on that report?

And then I realised that there was no way to work out which stakeholder depends on that report. Might be a useful counterpart to the existing owner property

jtcohen6 commented 3 years ago

I really need tags and meta in exposures. @jtcohen6 What's the status, can I start working on it?

@tomsej If this is something you're interested in contributing, I'm all for it! The addition of meta and tags feels pretty uncontroversial.

juangesino commented 3 years ago

Would love to see a datasource or dataset options for type. We use Tableau and this would allow us to link our dbt models with Data Sources.

aaronsteers commented 3 years ago

@juangesino - Thanks for your thoughts here - great to have a Tableau perspective. I like both of those (datasource and dataset) as well! Seems they are versatile and better fit the Tableau and Power BI use cases when there's an intermediate caching/cube layer between the database and the report.

jtalmi commented 3 years ago

from a slack conversation: @jtcohen6 came up with an idea to enable tracking of removed columns or changing data types in models that feed into exposures:

Could we leverage the catalog.json artifact for this? It contains the set of columns in every dbt (non-ephemeral) model. If dbt could compare the catalog from a prod run with the catalog from a CI run—assuming you had a docs generate step in both—it could check all models/sources that are 1+exposure:*, and raise a warning if any columns have been removed, or even if their data type has changed...

llarbodiere commented 3 years ago

Hello ! It could be nice to have an 'exposure report' who would display for all exposure the status (if the source db are passing), the freshness and some tests. Thanks :)

buremba commented 3 years ago

@juangesino I believe that #3404 can also help you define the Tableau relevant attributes under meta property.

kgeis commented 3 years ago

I don't like that the owner email is required. In my case, most of our consumers are internal to our team, and I only want to document the owner when it is external.

I'd like some granularity in the type/name. For an application, I might want (structured) the name of the application and the name of the use within the application. I have an exposure where I unstructured these metadata into a name ingester__sponsors meaning that the application name is "ingester" and within the Ingester application, the area is "sponsors". I would like to be able to select all of the Ingester exposures, but maybe that's best done with tags or meta.

sarahrehman commented 3 years ago

Is there any update on when more options for 'type' will be available? I'm particular looking for type: pipeline to become available

owlas commented 3 years ago

@jtcohen6

users supplying their own string types

What was the idea behind having a fixed enumeration of possible strings here? AFAIK there is nothing in dbt that relies on these string values. But other tools using dbt exposure data would benefit from being able to specify their own types of exposures.

Would it be acceptable to open a PR and make this a plain string type without an enum

EDIT: just saw your previous comment:

Yes, we're hard at work on some product features that tie into exposures, and thinking that we may want them to look or behave differently for different exposure types. I'd rather keep it structured for now and open it up later on if there's a lot of demand and variance, and few reasons to keep it limited.

Is this still relevant today?

jtcohen6 commented 3 years ago

Thanks for the great thoughts in this thread, everyone! It's been ten months since launch—the first experimental stake in the ground, the first metadata-forward dbt feature—and together we're making exposures happen.

In one of my comments above, I was making oblique reference to the very early stages of exposure status tiles, a dbt Cloud feature that my colleagues on the Metadata team just pushed into GA.

@llarbodiere Per your comment above, you may want to check that feature out. It sounds exactly like the status check you were asking for in May.
@fuchsst You might be interested, too, in the underlying API, since you were asking about an API/webhook that could check exposure status and thereby trigger downstream processes (e.g. in Tableau).

Some threads from above that I want to briefly pull on:

Check for changes

Using catalog.json to detect columns that have changed, since some previous catalog.json / docs generate, and present those findings in the context of "exposed" resources (1+exposure:*): I still really like this idea! This could be another --state feature in dbt, though at some point, we're not actually executing resources—we're really just diffing two JSON files / metadata inputs. This feels more like a "post-dbt" process. The dbt Cloud metadata API could be a really useful tool here, when it includes column-level info...

Selection

Selecting exposures on the basis of type/tags/meta (@kgeis): I've been thinking about this too! As of v0.20, and the addition of tags support in exposures, this is possible with the tag selection method. As a follow-on to some of the changes coming in #3616, which reconcile "configs" and "properties" in dbt, I think we could reclassify some exposure properties as configs, and thereby unlock things like:

dbt run -m +config.type:application  # i.e. exposures with type:application
dbt test -m +config.maturity:high

Relax required/limited properties

type (lots of folks above, most recently @owlas): I do think relaxing this is the right move. We weren't sure exactly how we wanted to be using this: I had imagined the possibility that our team would want to build different integration points for notebooks vs. apps vs. ad hoc analyses. By defining a strict set of strongly-suggested options, it also gave me the opportunity to make the case that these are for more than just BI dashboards ;)

Today, we mainly use this value for grouping/categorizing exposures, in places like the dbt-docs site. As I mentioned above, I could also see it powering a selection method. So there is some desire to keep these standardized and validated—tags and meta are now your go-to place for totally free-form input—but let's make it so dbt-core developers aren't blockers to the really cool things you all want to do.

A couple of ideas:

Users get to define their own set of valid exposure types somewhere (in dbt_project.yml?). Then, each defined exposure has to conform to one of those types. The idea here is to prevent someone from accidentally merging code with an exposure type defined as dshboard, or from putting classifier when we'd previously agreed on the more-generic ml.
It's a free-form input, we just do smarter things in the places where we bucket/categorize exposures. If an exposure is in a type category by itself, we should bucket it under Custom or Other instead.

Curious to hear your thoughts here!

owner.email (@kgeis): This is one I feel more strongly about keeping! It feels like a natural next step, after having the ability to reliably report on exposure status, is a system to notify the right person when upstream sources/models/tests of that exposure have failed—perhaps even modulating the severity/frequency of that notification based on its maturity. If the owner email is hello@example.com, that's a pretty good indicator that the exposure is not actively maintained, which is also useful information for a viewer :)

maturity: Folks don't seem to mind only having high/medium/low. As numbers go, three is a pretty good one.

david-kubecka commented 2 years ago

@jtcohen6 Re-opening up this (well, still open) issue :-)

What is your point/suggestion on the external actions/hooks being defined for exposures and triggered by dbt? I think this was already mentioned couple of times in this thread, e.g. in the context of testing reports/dashboards. Specifically in GoodData we would like to surface customers' reports as dbt exposures and perform semantic tests when underlying models change, ideally only for the reports which are affected by the changes. For this aspect we would like to utilize similar functionality as current "state" method. Is there any progress in thinking about this functionality (dbt running external commands, i.e. python scripts)?

Other way how exposures could be made more useful in BI tools lineage use case is tracking of broken metrics/reports upon destructive operations in the database (table/column removals). Curently one can specify models in the depends_on clause and dbt refuses to run if some of the models no longer exist (essentially because of broken DAG). It would be great if one could also specify the dependent columns. Also sometimes it might actually be ok to drop some DB table/columns and in this case one would like to assess the "blast radius", i.e. affected exposures before deploying such change to prod. I'm thinking here about something like dbt test --select exposures.

Related to that (and also mentioned above) is the dependencies between exposures. The typical BI tool use cases would be quite simple, e.g. metrics -> reports -> dashboards. Is this something you still consider as desirable?

Apologies if my questions/suggestions are too naive as I'm still familiarizing with dbt concepts :-)

hermandr commented 2 years ago

@jtcohen6 Looking forward to see the feature of defining exposure types. This will be a cool feature. I support this way of defining types:

Users get to define their own set of valid exposure types somewhere (in dbt_project.yml?). Then, each defined exposure has to conform to one of those types. The idea here is to prevent someone from accidentally merging code with an exposure type defined as dashboard, or from putting classifier when we'd previously agreed on the more-generic ml.

naitchman commented 2 years ago

Has there been any update in making exposures ref-able or is there a plan in a future update? This is a feature I think would be very helpful in my team so we can visualize how all of our reports are made.

akorniichuk commented 2 years ago

@jpau - Re your inquiry:

Should exposures be ref-able?

exposures that depend on other exposures: one exposure for each Mode query / Looker view, one exposure for the dashboard that depends on those queries / views

models that depend on exposures: modeled input to data science --> data science as exposure --> modeled output what exactly would ref('my_exposure') return?

Per my comments above, I'd personally prefer DBT to prioritize the static analysis and definitely-a-DAG properties which are critical to system stability. That said, I would love for the project to be able to reference another project's exposures, and perhaps that could even be performed as adding a reference pointer on a source definition like created_by: ref('other-project/exposure1'). I also would love for exposures to be able to reference other exposures - as in the example of a cube or bi_model exposure also having one or more of its own dashboard exposures which shows how end-users actually consume the data.

Thoughts?

+1 !!!! That's exactly what I am looking for right now. Even though I am coming from one, service-specific, tool such as Looker, I think there is also a need for this elsewhere and, if designed properly, can be universal.

Looker specific usage: there is a functionality where you can extend explores with other explores.

Ex:

explore: orders_and_customers {
    description: "This explore provides basic insights into customer base etc"
    view_name: orders
    join: customers {
        relationship: one_to_one --one order per customer
        type: left_outer
        sql_on: ${orders.customer_id} = ${customers.customer_id}
    }
}
explore: tracking_tools {
    extends: [orders_and_customers]
    description: "This explore allows to deep dive into how customers are using the tracking services for the orders"
    join: tracking_info {
        relationship: one_to_many
        type: left_outer
        sql_on: ${orders.order_id} = ${tracking_info.order_id}
   }
}

So, the example is basic but hopefully explains the usage. Say, I don't want the tracking links info being present to all the users and only to, say, PMs so I have 2 explores where one extends/enhances already existing one(s) and the extension can go indefinitely same as dbt lineage. In simple terms, orders_and_customers shows only 2 views, and tracking_tools shows all 3 views.

Since we have the ability to bring down the Documentation to dbt layer, it would be extremely helpful if we could reference exposures. Because, in this example, one exposure corresponds to one explore:

    - name: Orders and Customers
      type: dashboard
      description: >
            This explore provides basic insights into customer base etc
      depends_on:
        - ref('orders')
        - ref('customers')

    - name: tracking_tools
      type: dashboard
      description: >
            This explore allows to deep dive into how customers are using the tracking services for the orders
      depends_on:
        - ref('Orders and Customers')
        - ref('tracking_info')

I hope it's clear enough and, PLEEEEEASE, give us this functionality🥸

Let us know if you move this fuctionality up in priority and the approximate deadline 🙏

llarbodiere commented 2 years ago

Additional use case at Ludia/JamCity: We have 2 data stacks with inter-depencies. Some of our models are used downstream by tables and dashboards/reports.

We would like to define those downstream tables as exposure or external tables to reflect it in our DAG. But also have the possibility to put exposures (tableau dashboards) depending on other exposures (the downstream tables) so we can rebuild the DAG for the most important reports.

Thanks a lot, Lucas

dlawrences commented 2 years ago

I was looking into presenting some analysis object (https://docs.getdbt.com/docs/building-a-dbt-project/analyses) as an exposure. It is slightly misleading that you cannot (or maybe I couldn't figure it out since it's 10:30 PM) considering the fact that exposures do have a type that is called analysis (based on the above, it doesn't sound like these named members have any underlying logic).

The use case is one in which we are trying to present & document queries that drill across transactional fact tables to pull in together different data and aggregate/whatever. This qualifies the notion of an analysis, but we want it to be better managed, described and potentially rendered into dbt docs (we like the freedom of analysis objects, but crave the metadata of exposure objects).

jtcohen6 commented 2 years ago

I'd still be excited to make some of the changes mentioned here! In particular:

Balancing customization + governance for exposure types
Making exposures ref-able, and more neatly paired with other resource types (an "exposed" analysis, an "exposed" model)
Some interesting potential overlap with #5073...?

I'm going to convert this to a GitHub discussion, since that's what it's rightly been all along :)

dbt-labs / dbt-core