Closed jtcohen6 closed 2 years ago
Is there any reason that type
shouldn't just be an arbitrary string? Are y'all planning on having explicit functionality for the existing type
values in the future?
Also - would be nice to have a field for URLs linking to the exposure itself.
Is there any reason that
type
shouldn't just be an arbitrary string? Are y'all planning on having explicit functionality for the existingtype
values in the future?
Yes, we're hard at work on some product features that tie into exposures
, and thinking that we may want them to look or behave differently for different exposure types. I'd rather keep it structured for now and open it up later on if there's a lot of demand and variance, and few reasons to keep it limited.
Also - would be nice to have a field for URLs linking to the exposure itself.
There totally is, and I missed it when documenting! This is on me, I'll update that now (https://github.com/fishtown-analytics/docs.getdbt.com/pull/419).
It would be nice if there was a commands
field (similar to tox
) where, if applicable, you could enter a command that would check the 'status' (whatever that may be) of the exposure after running the dbt models that it depends on. E.g. pytest
, curl example.com/api/this-depends-on-dbt
, etc.
I suppose this can already be done by something like tox
anyway but I think it could be worth adding to dbt
if other users are keen
- models that depend on exposures: modeled input to data science --> data science as exposure --> modeled output
- what exactly would ref('my_exposure') return?
@jtcohen6 An idea:
source
that backreferences it through a source's (new) depends_on
attributesource(my_source)
rather than ref(my_exposure)
Elaborated on below.
Thoughts?
A DAG needs to tell us:
But take this current dbt DAG, and let exposure1
generate source2
:
source1
→ model1
→ exposure1
source2
→ model2
→ final_exposure
In this example, the DAG is unhelpful because it doesn't tell us what we need to know (lineage, and execution order).
Instead ref(my_exposure)
could expose a source
that backreferences it through a depends_on
attribute.
Ideally the DAG look something like:
source1
→ model1
→ exposure1
→ source2
→ model2
→ final_exposure
And in sources.yml:
sources:
- name: my_source_name
depends_on:
- ref('my_exposure')
tables:
- ...
And more probably source(my_source)
rather than ref(my_exposure)
.
exposures
(if they ingest from dbt models) that do not add to a source
directly, and so would not be correctly captured in the above. I don't think this is a big issue with the above?We build views in Looker on top of dbt models, and we would like to use extensions to indicate this. We could use type:application
, but ideally type
should be extended (eg bi_layer
, look_ml
etc) or as otherwise suggested type
should be user-defined.
Couple questions:
cache
, cube
, or composite dataset
? Of the current options, "application" is probably the closest, but none exactly seem to fit the description of a cube-like caching layer - which itself is a source for dashboards and analyses.@arniwesth - re:
We autogenerate views in Looker based on dbt models, and we would like to use extensions to indicate this. We could use
type:application
, but ideallytype
should be extended (egbi_layer
,look_ml
etc) or as otherwise suggestedtype
should be user-defined.
I feel the same for the "cube" or "xmla dataset" layer of the Power BI stack. I like the idea of keeping type as enumerated (perhaps with an other
option, but that's another topic) and I'm curious if there's a single term which could apply to both Looker BI and the Power BI stacks.
I like your idea of bi_layer
and wonder if something like semantic model
, bi model
, cube
, dataset
, or something along these lines would fit to the Looker model and Power BI and other tools. (I think in Tableau the analogous concept of "cached-middle-layer-with-applied-biz-logic" is "Tableau Extract" and "Tableau Shared Data Source".)
@jpau - Re your inquiry:
But take this current dbt DAG, and let exposure1 generate source2:
source1
→model1
→exposure1
source2
→model2
→final_exposure
In this example, the DAG is unhelpful because it doesn't tell us what we need to know (lineage, and execution order).
I'm not sure that this pattern can deterministically assure us that it is actually a DAG, or more specifically that the graph is acyclical. It looks like this approach would either have to allow circular loops (no longer a DAG) or else the initial run is not deterministic without directly synchronizing with the exposure.
Is it necessary for the consumption of the output from the exposure to be in the same logical DBT project? Or could you have a DBT project1
which generates input needed by the exposure and a DBT project2
which can safely assume your exposure? Breaking it into two projects seems like the best way to ensure the project can (a) benefit from static analysis and (b) make sure circular loops are not introduced (inadvertently or intentionally).
@jtcohen6 - Re your initial topic inquiry:
Should exposures be ref-able?
- exposures that depend on other exposures: one exposure for each Mode query / Looker view, one exposure for the dashboard that depends on those queries / views
- models that depend on exposures: modeled input to data science --> data science as exposure --> modeled output what exactly would ref('my_exposure') return?
Per my comments above, I'd personally prefer DBT prioritize the static analysis and definitely-a-DAG properties which are critical to system stability. That said, I would love for project to be able to reference another project's exposures, and perhaps that could even be performed as adding a reference pointer on a source
definition like created_by: ref('other-project/exposure1')
. I also would love for exposures to be able to reference other exposures - as in the example of a cube
or bi_model
exposure also having one or more of its own dashboard
exposures which shows how end users actually consume the data.
Thoughts?
Thanks @aaronsteers
I'm not sure that this pattern can deterministically assure us that it is actually a DAG, or more specifically that the graph is acyclical.
I'm curious about this. This isn't obvious to me. Do you mind providing an example of where this style of dependency differs to others, in such a way that it creates a cycle?
Is it necessary for the consumption of the output from the exposure to be in the same logical DBT project? Or could you have a DBT
project1
which generates input needed by the exposure and a DBTproject2
which can safely assume your exposure? Breaking it into two projects seems like the best way to ensure the project can (a) benefit from static analysis and (b) make sure circular loops are not introduced (inadvertently or intentionally).
Interesting idea! In the given example, absolutely. But I don't think this is practical if you have several exposure
-style dependencies spread throughout a modest DAG, such as
exposure
for each.Splitting the project once is okay. I'd be concerned about it splitting it twice. I think anything beyond that would be opaque.
I'm not sure that this pattern can deterministically assure us that it is actually a DAG, or more specifically that the graph is acyclical.
I'm curious about this. This isn't obvious to me. Do you mind providing an example of where this style of dependency differs to others, in such a way that it creates a cycle?
Sorry - I realize I wasn't very clear. So there are basically two scenarios I can think of where an exposure feeds back into project models of the same project. Both certainly make sense, and I think I've seen both in real-world (non-dbt) projects.
Taking the example of an ml
-type exposure:
Necessary pause
" - the output of the ml
model is the input to a source or a model which cannot be built without the completion of the exposure.
Updates own source
" - the output of the ml
model is a required input for one or more models which in turn (directly or indirectly) feeds into models which are depended by in the exposure.
Notes:
scenario 1
option (which is to say the project has to pause and wait for exposure to complete) dbt is not able to invoke or document the end-to-end flow until and unless the exposure has been executed. Anything downstream from the exposure would have to pause and wait for the exposure to complete externally - which represents an orchestration and scheduling challenge.
project1
is fully self-sufficient, and project2
takes for a given that project1
and the ml
exposure have already both completed prior to its execution.Scenario 2
would be supported within DBT, since circular dependencies are by definition not allowed in a DAG.
Possibility to define exposures as operations:
Another option which resolves the issues with the "pausing and polling" needed by scenario 1
would be to pair the exposure
feature with a predefined dbt operation
feature. In that case, dbt would be able to directly call and execute the build of the exposure (in our example the ml
model) and could therefor still execute an end-to-end pipeline without having to stop and block on an externally defined and externally orchestrated process.
@jpau - What do you think about this?
@aaronsteers I agree with your separation into the two scenarios, and the fact that we'd be more interested in building out supportive tooling for the first scenario. I just want to bring into conversation another recent issue (https://github.com/fishtown-analytics/dbt/issues/2894) on the subject of "3rd-party operations" about which dbt could or should know. A potential workaround for the time being is to have a placeholder model that stands in for the external process, allowing the dbt DAG to reflect your actual one. Separating out the workflows and orchestrating through a third-party tool would be the most robust, but if you have access to things like Snowflake external functions and system$wait
, it wouldn't be impossible for the model itself to trigger and wait for the ml
model's completion.
At the moment exposures are purely a documentation feature. In most cases they refer to another system (like Tableau) that might need to get a notification to finalize/proceed with the pipeline (in case of Tableau, trigger the data extraction for a data source or trigger a task). This might be in a very generic way (as the use case to check the status, described above). Or more specific, like do a REST call (with separate attribute fields for url, parameter, body, header, authentication) - so basically a webhook...
We autogenerate views in Looker based on dbt models, and we would like to use extensions to indicate this. We could use
type:application
, but ideallytype
should be extended (egbi_layer
,look_ml
etc) or as otherwise suggestedtype
should be user-defined.
Hey @arniwesth ! I'm thinking about doing the same (hence why I was crawling exposure-related threads). Would you be open to sharing your experience with this approach and perhaps collaborating on a dbt package together?
Hey @tnightengale. Unfortunately, I did not formulate this very well. What I actually meant, was that we just user Lookers "Create View From Table" to create models from dbt generated views/tables. It would be awesome if it could actually be automized, but it's not something we have looked into. Sorry for the confusion - I updated my comment to better reflect what we actually do.
I really need tags
and meta
in exposures. @jtcohen6 What's the status, can I start working on it?
I just wrote this in Slack:
Can you check with whichever stakeholder depends on that report?
And then I realised that there was no way to work out which stakeholder depends on that report. Might be a useful counterpart to the existing owner
property
I really need
tags
andmeta
in exposures. @jtcohen6 What's the status, can I start working on it?
@tomsej If this is something you're interested in contributing, I'm all for it! The addition of meta
and tags
feels pretty uncontroversial.
Would love to see a datasource
or dataset
options for type
. We use Tableau and this would allow us to link our dbt models with Data Sources.
@juangesino - Thanks for your thoughts here - great to have a Tableau perspective. I like both of those (datasource
and dataset
) as well! Seems they are versatile and better fit the Tableau and Power BI use cases when there's an intermediate caching/cube layer between the database and the report.
from a slack conversation: @jtcohen6 came up with an idea to enable tracking of removed columns or changing data types in models that feed into exposures:
Could we leverage the catalog.json artifact for this? It contains the set of columns in every dbt (non-ephemeral) model. If dbt could compare the catalog from a prod run with the catalog from a CI run—assuming you had a docs generate step in both—it could check all models/sources that are 1+exposure:*, and raise a warning if any columns have been removed, or even if their data type has changed...
Hello ! It could be nice to have an 'exposure report' who would display for all exposure the status (if the source db are passing), the freshness and some tests. Thanks :)
@juangesino I believe that #3404 can also help you define the Tableau relevant attributes under meta
property.
I don't like that the owner email is required. In my case, most of our consumers are internal to our team, and I only want to document the owner when it is external.
I'd like some granularity in the type/name. For an application, I might want (structured) the name of the application and the name of the use within the application. I have an exposure where I unstructured these metadata into a name ingester__sponsors
meaning that the application name is "ingester" and within the Ingester application, the area is "sponsors". I would like to be able to select all of the Ingester exposures, but maybe that's best done with tags or meta.
Is there any update on when more options for 'type' will be available? I'm particular looking for type: pipeline
to become available
@jtcohen6
users supplying their own string types
What was the idea behind having a fixed enumeration of possible strings here? AFAIK there is nothing in dbt that relies on these string values. But other tools using dbt exposure data would benefit from being able to specify their own types of exposures.
Would it be acceptable to open a PR and make this a plain string
type without an enum
EDIT: just saw your previous comment:
Yes, we're hard at work on some product features that tie into exposures, and thinking that we may want them to look or behave differently for different exposure types. I'd rather keep it structured for now and open it up later on if there's a lot of demand and variance, and few reasons to keep it limited.
Is this still relevant today?
Thanks for the great thoughts in this thread, everyone! It's been ten months since launch—the first experimental stake in the ground, the first metadata-forward dbt feature—and together we're making exposures happen.
In one of my comments above, I was making oblique reference to the very early stages of exposure status tiles, a dbt Cloud feature that my colleagues on the Metadata team just pushed into GA.
Some threads from above that I want to briefly pull on:
Using catalog.json
to detect columns that have changed, since some previous catalog.json
/ docs generate
, and present those findings in the context of "exposed" resources (1+exposure:*
): I still really like this idea! This could be another --state
feature in dbt, though at some point, we're not actually executing resources—we're really just diffing two JSON files / metadata inputs. This feels more like a "post-dbt" process. The dbt Cloud metadata API could be a really useful tool here, when it includes column-level info...
Selecting exposures on the basis of type/tags/meta (@kgeis): I've been thinking about this too! As of v0.20, and the addition of tags
support in exposures, this is possible with the tag
selection method. As a follow-on to some of the changes coming in #3616, which reconcile "configs" and "properties" in dbt, I think we could reclassify some exposure properties as configs, and thereby unlock things like:
dbt run -m +config.type:application # i.e. exposures with type:application
dbt test -m +config.maturity:high
type
(lots of folks above, most recently @owlas): I do think relaxing this is the right move. We weren't sure exactly how we wanted to be using this: I had imagined the possibility that our team would want to build different integration points for notebooks vs. apps vs. ad hoc analyses. By defining a strict set of strongly-suggested options, it also gave me the opportunity to make the case that these are for more than just BI dashboards ;)
Today, we mainly use this value for grouping/categorizing exposures, in places like the dbt-docs site. As I mentioned above, I could also see it powering a selection method. So there is some desire to keep these standardized and validated—tags
and meta
are now your go-to place for totally free-form input—but let's make it so dbt-core developers aren't blockers to the really cool things you all want to do.
A couple of ideas:
dbt_project.yml
?). Then, each defined exposure has to conform to one of those types. The idea here is to prevent someone from accidentally merging code with an exposure type defined as dshboard
, or from putting classifier
when we'd previously agreed on the more-generic ml
.type
category by itself, we should bucket it under Custom
or Other
instead.Curious to hear your thoughts here!
owner.email
(@kgeis): This is one I feel more strongly about keeping! It feels like a natural next step, after having the ability to reliably report on exposure status, is a system to notify the right person when upstream sources/models/tests of that exposure have failed—perhaps even modulating the severity/frequency of that notification based on its maturity. If the owner email is hello@example.com
, that's a pretty good indicator that the exposure is not actively maintained, which is also useful information for a viewer :)
maturity
: Folks don't seem to mind only having high
/medium
/low
. As numbers go, three is a pretty good one.
@jtcohen6 Re-opening up this (well, still open) issue :-)
What is your point/suggestion on the external actions/hooks being defined for exposures and triggered by dbt? I think this was already mentioned couple of times in this thread, e.g. in the context of testing reports/dashboards. Specifically in GoodData we would like to surface customers' reports as dbt exposures and perform semantic tests when underlying models change, ideally only for the reports which are affected by the changes. For this aspect we would like to utilize similar functionality as current "state" method. Is there any progress in thinking about this functionality (dbt running external commands, i.e. python scripts)?
Other way how exposures could be made more useful in BI tools lineage use case is tracking of broken metrics/reports upon destructive operations in the database (table/column removals). Curently one can specify models in the depends_on
clause and dbt refuses to run if some of the models no longer exist (essentially because of broken DAG). It would be great if one could also specify the dependent columns. Also sometimes it might actually be ok to drop some DB table/columns and in this case one would like to assess the "blast radius", i.e. affected exposures before deploying such change to prod. I'm thinking here about something like dbt test --select exposures
.
Related to that (and also mentioned above) is the dependencies between exposures. The typical BI tool use cases would be quite simple, e.g. metrics -> reports -> dashboards. Is this something you still consider as desirable?
Apologies if my questions/suggestions are too naive as I'm still familiarizing with dbt concepts :-)
@jtcohen6 Looking forward to see the feature of defining exposure types. This will be a cool feature. I support this way of defining types:
Users get to define their own set of valid exposure types somewhere (in dbt_project.yml?).
Then, each defined exposure has to conform to one of those types. The idea here is to prevent someone from accidentally merging code with an exposure type defined as dashboard
, or from putting classifier
when we'd previously agreed on the more-generic ml
.
Has there been any update in making exposures ref-able or is there a plan in a future update? This is a feature I think would be very helpful in my team so we can visualize how all of our reports are made.
@jpau - Re your inquiry:
Should exposures be ref-able?
- exposures that depend on other exposures: one exposure for each Mode query / Looker view, one exposure for the dashboard that depends on those queries / views
- models that depend on exposures: modeled input to data science --> data science as exposure --> modeled output what exactly would ref('my_exposure') return?
Per my comments above, I'd personally prefer DBT to prioritize the static analysis and definitely-a-DAG properties which are critical to system stability. That said, I would love for the project to be able to reference another project's exposures, and perhaps that could even be performed as adding a reference pointer on a
source
definition likecreated_by: ref('other-project/exposure1')
. I also would love for exposures to be able to reference other exposures - as in the example of acube
orbi_model
exposure also having one or more of its owndashboard
exposures which shows how end-users actually consume the data.Thoughts?
+1 !!!! That's exactly what I am looking for right now. Even though I am coming from one, service-specific, tool such as Looker, I think there is also a need for this elsewhere and, if designed properly, can be universal.
Looker specific usage: there is a functionality where you can extend explores with other explores.
Ex:
explore: orders_and_customers {
description: "This explore provides basic insights into customer base etc"
view_name: orders
join: customers {
relationship: one_to_one --one order per customer
type: left_outer
sql_on: ${orders.customer_id} = ${customers.customer_id}
}
}
explore: tracking_tools {
extends: [orders_and_customers]
description: "This explore allows to deep dive into how customers are using the tracking services for the orders"
join: tracking_info {
relationship: one_to_many
type: left_outer
sql_on: ${orders.order_id} = ${tracking_info.order_id}
}
}
So, the example is basic but hopefully explains the usage. Say, I don't want the tracking links info being present to all the users and only to, say, PMs so I have 2 explores where one extends/enhances already existing one(s) and the extension can go indefinitely same as dbt lineage. In simple terms, orders_and_customers
shows only 2 views, and tracking_tools
shows all 3 views.
Since we have the ability to bring down the Documentation to dbt layer, it would be extremely helpful if we could reference exposures. Because, in this example, one exposure corresponds to one explore:
- name: Orders and Customers
type: dashboard
description: >
This explore provides basic insights into customer base etc
depends_on:
- ref('orders')
- ref('customers')
- name: tracking_tools
type: dashboard
description: >
This explore allows to deep dive into how customers are using the tracking services for the orders
depends_on:
- ref('Orders and Customers')
- ref('tracking_info')
I hope it's clear enough and, PLEEEEEASE, give us this functionality🥸
Let us know if you move this fuctionality up in priority and the approximate deadline 🙏
Additional use case at Ludia/JamCity: We have 2 data stacks with inter-depencies. Some of our models are used downstream by tables and dashboards/reports.
We would like to define those downstream tables as exposure or external tables to reflect it in our DAG. But also have the possibility to put exposures (tableau dashboards) depending on other exposures (the downstream tables) so we can rebuild the DAG for the most important reports.
Thanks a lot, Lucas
I was looking into presenting some analysis
object (https://docs.getdbt.com/docs/building-a-dbt-project/analyses) as an exposure
. It is slightly misleading that you cannot (or maybe I couldn't figure it out since it's 10:30 PM) considering the fact that exposures do have a type that is called analysis (based on the above, it doesn't sound like these named members have any underlying logic).
The use case is one in which we are trying to present & document queries that drill across transactional fact tables to pull in together different data and aggregate/whatever. This qualifies the notion of an analysis, but we want it to be better managed, described and potentially rendered into dbt docs (we like the freedom of analysis
objects, but crave the metadata of exposure
objects).
I'd still be excited to make some of the changes mentioned here! In particular:
ref
-able, and more neatly paired with other resource types (an "exposed" analysis, an "exposed" model)I'm going to convert this to a GitHub discussion, since that's what it's rightly been all along :)
Describe the feature
What other properties should
exposures
get?tags
meta
meta
fields (e.g.owner
) to be required or top-leveltags
ordescription
type
options:maturity
optionsShould exposures be ref-able?
ref('my_exposure')
return?Describe alternatives you've considered
We're likely to keep these bare-bones for a little while. I'm still curious to hear what community members want!
Who will this benefit?
Users of the
exposures
resource type, which is new in v0.18.1