Open MichelleArk opened 1 year ago
Hey @MichelleArk, excited about this enhancement, as it intersects with some planned work that we were going to execute using ref/source overrides!
We definitely would love to see this in a really configurable form, like the where
config you've suggested, or even additionally an order by
(ie order by random()
). Essentially, the more customizable the better from our perspective!
Thanks for opening @MichelleArk! :)
I love how simple the --sample
boolean flag is, with all the actual configuration defined in code.
Question: Is this sufficient in practice? Will users need the ability to define multiple different sampling strategies for the same models, to run in different environment (dev vs. CI vs. QA vs. ...)?
If so: Should that look like lots of vars / env vars within the sample
config block? Or, should it look like different "named" sample
strategies that can be referenced on the CLI?
I'd offer a few more arguments in favor of supporting column-specific filters:
This could look like, defining additional attributes on columns
in a model's properties:
# models/properties.yml
models:
- name: my_model
columns:
- name: customer_id
sample: [10, 30] # array -> 'in' filter
- name: created_at
sample: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}" # string -> filter as is
Then, anywhere you write:
-- models/another_model.sql
with my_model as (
select * from {{ ref('my_model') }}
), ...
dbt would actually resolve that as a subquery like:
-- models/another_model.sql
with my_model as (
select * from (
select * from <database>.<schema>.my_model
where customer_id in (10, 30)
and created_at > '2023-08-11' # writing this on Aug 14
) subq
), ...
Or, it could also be a config that's set for many models at once (or even every model in the project):
# dbt_project.yml
models:
my_project:
+sample:
customer_id: [10, 30] # array -> 'in' filter
created_at: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}" # string -> filter as is
The latter option would require dbt to check the set of columns first, to see if fields named customer_id
and created_at
actually exist (case-insensitive of course). As dbt is templating out each source
/ ref
call, it would figure out if the model/source being referenced has those columns. I think we could do that in one of two ways:
Bonus considerations:
where
configs—maybe even better!—but the two features should work happily together.filter(...).filter(...)
(snowpark, pyspark, pandas) instead of where ... and ...
.@jaypeedevlin I'm happy to hear that this piques your interest! I'm curious - what would be your motivation for order by random()
? I'm thinking it's the exact opposite of what I'm describing above: give me a truly random smattering of data, which is unlikely to match up with other models in downstream joins / dashboarding.
Love this topic!
I also agree that column-specific filters predicates might be more interesting for consistency across models or even just on its own. If you have years of data in your model, I feel it might be more relevant to get everything from the past month rather than random data points scattered - although probably depends on the use cases and what you are trying to test!
I'm wondering if we could customize where to apply this filter though. As I understand it, if one sets this in a .yml file then I assume it would be executed at the end of the model whereas it might be better performance-wise to run it early - especially if you have big tables and a lot of transformations in a model.
We have been doing something similar, on the first group of CTEs of certain models where we simply select from source or ref, we added this block:
select * from {{ ref('stage') }}
{% if target.name != 'prod' and var("dev_load_full") != true %}
where to_date("timestamp") >= dateadd('day', -{{ var("dev_load_days") }}, current_date())
{% endif %}
And we define those variables in the dbt_project.yml
as default:
vars:
dev_load_full: false
dev_load_days: 7
This way we can sample the data early in the query rather than later.
@jtcohen6 that was a reasonably arbitrary example of truly random sampling. Most of our models are incremental so we'd be (in theory) randomly sampling just the most recent data anyway. Definitely not the most sophisticated method for my example though, I'll grant you that, but just arguing to make this as configurable as possible.
Really love the conversation going here - I have personally written a handful of hack-y work-arounds to accomplish this exact thing, so would be excited to see a native approach in dbt-core
, especially for folks trying to optimize time and warehouse spend! Curious about the name "sample" vs. something like "limit" - I know we already have a limit config for test outputs and a limit for the dbt show
command. Thinking more about it, I like that the name "sample" is distinguished from "limiting the outputs" (which it's not!).
I can also imagine folks wanting to configure different samples per environment, which is something we could enable using environment variables.
# dbt_project.yml
models:
my_project:
+sample: {{ env_var('sample') }}
I'm definitely in favor of folks being able to define a filter, which we recommend in our best practices guide and has come up in a handful of solutions.
@QuentinCoviaux my understanding here is that running dbt in --sample
mode actually affects what {{ ref() }}
and {{ source() }}
compile to, rather than adding a limit at the end of a model. So, for example, if we had a model with many CTEs:
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
payments as (
select * from {{ ref('stg_payments') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order,
max(order_date) as most_recent_order,
count(order_id) as number_of_orders
from orders
group by customer_id
),
customer_payments as (
select
orders.customer_id,
sum(amount) as total_amount
from payments
left join orders on
payments.order_id = orders.order_id
group by orders.customer_id
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount as customer_lifetime_value
from customers
left join customer_orders
on customers.customer_id = customer_orders.customer_id
left join customer_payments
on customers.customer_id = customer_payments.customer_id
)
select * from final
when executing in --sample
mode, the {{ ref() }}
statements here would compile to their raw location plus their configured limits. So, if we had configured samples as such:
models:
- name: stg_customers
config:
sample:
size: 10
- name: stg_orders
config:
sample:
size: 50
- name: stg_payments
config:
sample:
size: 75
the compiled code would look something like this:
with customers as (
select * from my_warehouse.my_schema.stg_customers limit 10
),
orders as (
select * from my_warehouse.my_schema.stg_orders limit 50
),
payments as (
select * from my_warehouse.my_schema.stg_payments limit 75
),
customer_orders as (
select
customer_id,
min(order_date) as first_order,
max(order_date) as most_recent_order,
count(order_id) as number_of_orders
from orders
group by customer_id
),
customer_payments as (
select
orders.customer_id,
sum(amount) as total_amount
from payments
left join orders on
payments.order_id = orders.order_id
group by orders.customer_id
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount as customer_lifetime_value
from customers
left join customer_orders
on customers.customer_id = customer_orders.customer_id
left join customer_payments
on customers.customer_id = customer_payments.customer_id
)
select * from final
In that case, we would actually be filtering as early as possible.
I have seen this sort of problem solved in 2 ways:
I'd be curious @jtcohen6 and @MichelleArk for y'all's thoughts on the pros / cons of each approach - and why "sampling" is better than "limiting".
Sampling is actually pretty hard. Data developers will sometimes use sample-ish hacks (like testing with the last 3 days’ worth of data, or using a
sample
function built into the database) but these can often come back to bite you. Samples need to take into account referential integrity and unevenly distributed data if they are to be any use when testing.
SnowShu is a sampling engine designed to support testing in data development.
Here's one solution for sampling by Health Union:
Caveat: I haven't tried it out personally.
Here's some other interesting links from Health Union:
It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as select * from
where limit . This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.
Just adding a +1 here to being able to sample via adding a filter!
Great discussion here, just wanted to add my 2c stemming from a conversation with @dbeatty10:
Could we sample with an optional
join_key
config?
Here's an example of how I could see this working:
n
random rows from the customers
tablejoin_key
on transactions
table allows us to find all transactions for those customers instead of randomly sampling from customers
This could be a natural extension to the work in https://github.com/dbt-labs/dbt-core/issues/8652 (unit-test rows from seeds) to enable unit-testing on sample rows from models.
To wrap it all up, we could create a new “hybrid” test that takes this sample as an input but checks for data assertions as output (e.g., check that customer status is churned
if they haven't made a purchase in the last 12 months).
This could help alleviate mocking fatigue in unit testing for fct
/dim
tables that depend on dozens of int
/stg
tables.
Sampling sounds like a great idea! I would also second the ask for full customisation of this configuration. Not just limit but ordering and even using a SQL filtering for it. For example in BigQuery, I would use one partition for example. However it could be expanded with time. Having it to begin with is a great start!
Bonus considerations:
- Making sure this works for tests. For generic tests, it's like a default set of
where
configs—maybe even better!—but the two features should work happily together.
Just checking whether this will be included? (🥺)
Is this your first time submitting a feature request?
Describe the feature
Introduce a new boolean flag,
--sample/--no-sample
, that informs whether dbt should run in 'sample' mode.In sample mode, dbt would resolve
ref
andsource
identifiers asselect * from <identifier> limit <sample-size>
.A default sample size could be used (perhaps 100?), or set in
config
at the model-level to inform the size of the sample to select when referencing the model.This would also enable setting a sample size using the hierarchical config syntax in dbt_project.yml:
It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as
select * from <identifier> where <filters> limit <sample-size>
. This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.We may also want to create an equivalent env_var, so folks could set a sample for a given environment
Describe alternatives you've considered
It's possible to implement something like this with overrides of ref & source macros, but that makes dbt upgrades more difficult.
Also considered a simpler version of this where
--sample
is an integer argument representing the sample size in https://github.com/dbt-labs/dbt-core/pull/8337. This doesn't leave design space for configuring filters on model/source samples however.Who will this benefit?
any dbt user looking for speedier (at the cost of completeness/accuracy) development & CI runs!
Are you interested in contributing this feature?
yep!
Anything else?
Took a first pass at this in a spike here: https://github.com/dbt-labs/dbt-core/pull/8337