[CT-2967] [Feature] run dbt in --sample mode

MichelleArk commented 1 year ago

Is this your first time submitting a feature request?

[X] I have read the expectations for open source contributors
[X] I have searched the existing issues, and I could not find an existing issue for this feature
[X] I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Introduce a new boolean flag, --sample/--no-sample, that informs whether dbt should run in 'sample' mode.

In sample mode, dbt would resolve ref and source identifiers as select * from <identifier> limit <sample-size>.

A default sample size could be used (perhaps 100?), or set in config at the model-level to inform the size of the sample to select when referencing the model.

models:
  - name: my_model
     config: 
       sample: 
         size: 10

This would also enable setting a sample size using the hierarchical config syntax in dbt_project.yml:

--- dbt_project.yml
models: 
  <resource-path>:
    + sample: {<dictionary>}

It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as select * from <identifier> where <filters> limit <sample-size>. This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.

We may also want to create an equivalent env_var, so folks could set a sample for a given environment

Describe alternatives you've considered

It's possible to implement something like this with overrides of ref & source macros, but that makes dbt upgrades more difficult.

Also considered a simpler version of this where --sample is an integer argument representing the sample size in https://github.com/dbt-labs/dbt-core/pull/8337. This doesn't leave design space for configuring filters on model/source samples however.

Who will this benefit?

any dbt user looking for speedier (at the cost of completeness/accuracy) development & CI runs!

Are you interested in contributing this feature?

yep!

Anything else?

Took a first pass at this in a spike here: https://github.com/dbt-labs/dbt-core/pull/8337

jaypeedevlin commented 1 year ago

Hey @MichelleArk, excited about this enhancement, as it intersects with some planned work that we were going to execute using ref/source overrides!

We definitely would love to see this in a really configurable form, like the where config you've suggested, or even additionally an order by (ie order by random()). Essentially, the more customizable the better from our perspective!

jtcohen6 commented 1 year ago

Thanks for opening @MichelleArk! :)

I love how simple the --sample boolean flag is, with all the actual configuration defined in code.

Question: Is this sufficient in practice? Will users need the ability to define multiple different sampling strategies for the same models, to run in different environment (dev vs. CI vs. QA vs. ...)?

If so: Should that look like lots of vars / env vars within the sample config block? Or, should it look like different "named" sample strategies that can be referenced on the CLI?

I'd offer a few more arguments in favor of supporting column-specific filters:

Pick a few representative segments (customers/regions/etc) in development & CI → sense-check in downstream dashboarding ("last-month revenue for Bouches-du-Rhône isn't looking quite right")
For time-series data, limit based on recency (e.g. "last 3 days"), rather than data scattered across time
Benefit from partition pruning on DWHs (like BQ) where that's the only sure way to achieve better time/cost savings

This could look like, defining additional attributes on columns in a model's properties:

# models/properties.yml
models:
  - name: my_model
    columns:
      - name: customer_id
        sample: [10, 30]  # array -> 'in' filter
      - name: created_at
        sample: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}"  # string -> filter as is

Then, anywhere you write:

-- models/another_model.sql
with my_model as (
    select * from {{ ref('my_model') }}
), ...

dbt would actually resolve that as a subquery like:

-- models/another_model.sql
with my_model as (
    select * from (
       select * from <database>.<schema>.my_model
       where customer_id in (10, 30)
          and created_at > '2023-08-11'  # writing this on Aug 14
    ) subq
), ...

Or, it could also be a config that's set for many models at once (or even every model in the project):

# dbt_project.yml
models:
  my_project:
    +sample:
      customer_id: [10, 30]  # array -> 'in' filter
      created_at: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}"  # string -> filter as is

The latter option would require dbt to check the set of columns first, to see if fields named customer_id and created_at actually exist (case-insensitive of course). As dbt is templating out each source / ref call, it would figure out if the model/source being referenced has those columns. I think we could do that in one of two ways:

If those models have enforced contracts → save an introspective DWH query
If those models already exist in the development schema (or defer to prod) → introspective DWH query

Bonus considerations:

Making sure this works for tests. For generic tests, it's like a default set of where configs—maybe even better!—but the two features should work happily together.
Making sure this works for Python models (!), where the filters might need to be applied via filter(...).filter(...) (snowpark, pyspark, pandas) instead of where ... and ....

@jaypeedevlin I'm happy to hear that this piques your interest! I'm curious - what would be your motivation for order by random()? I'm thinking it's the exact opposite of what I'm describing above: give me a truly random smattering of data, which is unlikely to match up with other models in downstream joins / dashboarding.

QuentinCoviaux commented 1 year ago

Love this topic!

I also agree that column-specific filters predicates might be more interesting for consistency across models or even just on its own. If you have years of data in your model, I feel it might be more relevant to get everything from the past month rather than random data points scattered - although probably depends on the use cases and what you are trying to test!

I'm wondering if we could customize where to apply this filter though. As I understand it, if one sets this in a .yml file then I assume it would be executed at the end of the model whereas it might be better performance-wise to run it early - especially if you have big tables and a lot of transformations in a model.

We have been doing something similar, on the first group of CTEs of certain models where we simply select from source or ref, we added this block:

select * from {{ ref('stage') }}
{% if target.name != 'prod' and var("dev_load_full") != true %}
where to_date("timestamp") >= dateadd('day', -{{ var("dev_load_days") }}, current_date())
{% endif %}

And we define those variables in the dbt_project.yml as default:

vars:
  dev_load_full: false
  dev_load_days: 7

This way we can sample the data early in the query rather than later.

jaypeedevlin commented 1 year ago

@jtcohen6 that was a reasonably arbitrary example of truly random sampling. Most of our models are incremental so we'd be (in theory) randomly sampling just the most recent data anyway. Definitely not the most sophisticated method for my example though, I'll grant you that, but just arguing to make this as configurable as possible.

graciegoheen commented 1 year ago

Really love the conversation going here - I have personally written a handful of hack-y work-arounds to accomplish this exact thing, so would be excited to see a native approach in dbt-core, especially for folks trying to optimize time and warehouse spend! Curious about the name "sample" vs. something like "limit" - I know we already have a limit config for test outputs and a limit for the dbt show command. Thinking more about it, I like that the name "sample" is distinguished from "limiting the outputs" (which it's not!).

I can also imagine folks wanting to configure different samples per environment, which is something we could enable using environment variables.

# dbt_project.yml
models:
  my_project:
    +sample: {{ env_var('sample') }}

I'm definitely in favor of folks being able to define a filter, which we recommend in our best practices guide and has come up in a handful of solutions.

@QuentinCoviaux my understanding here is that running dbt in --sample mode actually affects what {{ ref() }} and {{ source() }} compile to, rather than adding a limit at the end of a model. So, for example, if we had a model with many CTEs:

with customers as (

    select * from {{ ref('stg_customers') }}

),

orders as (

    select * from {{ ref('stg_orders') }}

),

payments as (

    select * from {{ ref('stg_payments') }}

),

customer_orders as (

        select
        customer_id,

        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders

    group by customer_id

),

customer_payments as (

    select
        orders.customer_id,
        sum(amount) as total_amount

    from payments

    left join orders on
         payments.order_id = orders.order_id

    group by orders.customer_id

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value

    from customers

    left join customer_orders
        on customers.customer_id = customer_orders.customer_id

    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id

)

select * from final

when executing in --sample mode, the {{ ref() }} statements here would compile to their raw location plus their configured limits. So, if we had configured samples as such:

models:
  - name: stg_customers
     config: 
       sample: 
         size: 10
  - name: stg_orders
     config: 
       sample: 
         size: 50
  - name: stg_payments
     config: 
       sample: 
         size: 75

the compiled code would look something like this:

with customers as (

    select * from my_warehouse.my_schema.stg_customers limit 10

),

orders as (

    select * from my_warehouse.my_schema.stg_orders limit 50

),

payments as (

    select * from my_warehouse.my_schema.stg_payments limit 75

),

customer_orders as (

        select
        customer_id,

        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders

    group by customer_id

),

customer_payments as (

    select
        orders.customer_id,
        sum(amount) as total_amount

    from payments

    left join orders on
         payments.order_id = orders.order_id

    group by orders.customer_id

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value

    from customers

    left join customer_orders
        on customers.customer_id = customer_orders.customer_id

    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id

)

select * from final

In that case, we would actually be filtering as early as possible.

I have seen this sort of problem solved in 2 ways:

overriding the ref & source macros (this solution)
adding a limit macro to the end of relevant model (there's lots of old discussions on this)

I'd be curious @jtcohen6 and @MichelleArk for y'all's thoughts on the pros / cons of each approach - and why "sampling" is better than "limiting".

dbeatty10 commented 12 months ago

Insights about sampling

"Why not just take samples?"

Sampling is actually pretty hard. Data developers will sometimes use sample-ish hacks (like testing with the last 3 days’ worth of data, or using a sample function built into the database) but these can often come back to bite you. Samples need to take into account referential integrity and unevenly distributed data if they are to be any use when testing.

One solution

SnowShu is a sampling engine designed to support testing in data development.

Here's one solution for sampling by Health Union:

https://snowshu.readthedocs.io/en/latest/

Caveat: I haven't tried it out personally.

Interesting links

Here's some other interesting links from Health Union:

Medium: https://medium.com/@ethan.m.knox/introducing-snowshu-217d5bf7548a
PyPI: https://pypi.org/project/snowshu/
SnowShu: https://github.com/Health-Union/snowshu
Rest of their repos: https://github.com/Health-Union
Cross-database macros: https://github.com/Health-Union/dbt-xdb

graciegoheen commented 12 months ago

It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as select * from where limit . This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.

Just adding a +1 here to being able to sample via adding a filter!

aranke commented 11 months ago

Great discussion here, just wanted to add my 2c stemming from a conversation with @dbeatty10:

Could we sample with an optional join_key config?

Here's an example of how I could see this working:

Sample selects n random rows from the customers table
join_key on transactions table allows us to find all transactions for those customers instead of randomly sampling from customers

This could be a natural extension to the work in https://github.com/dbt-labs/dbt-core/issues/8652 (unit-test rows from seeds) to enable unit-testing on sample rows from models.

To wrap it all up, we could create a new “hybrid” test that takes this sample as an input but checks for data assertions as output (e.g., check that customer status is churned if they haven't made a purchase in the last 12 months).

This could help alleviate mocking fatigue in unit testing for fct/dim tables that depend on dozens of int/stg tables.

moseleyi commented 7 months ago

Sampling sounds like a great idea! I would also second the ask for full customisation of this configuration. Not just limit but ordering and even using a SQL filtering for it. For example in BigQuery, I would use one partition for example. However it could be expanded with time. Having it to begin with is a great start!

joellabes commented 1 week ago

Bonus considerations:

Making sure this works for tests. For generic tests, it's like a default set of where configs—maybe even better!—but the two features should work happily together.

Just checking whether this will be included? (🥺)

dbt-labs / dbt-core