dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.61k stars 1.59k forks source link

[CT-3032] `dbt show --limit` should include the limit in the DWH query, to avoid long-running queries #8496

Closed aranke closed 11 months ago

aranke commented 1 year ago

Housekeeping

Short description

Most adapters (check impact matrix below) load the entire model into memory while running dbt show. The goal of this ticket is to avoid this behavior and only load the relevant rows from the model.

Acceptance criteria

  1. Inline limit into the query instead of using the fetchmany method for dbt show.
  2. Confirm that the query sent to Snowflake does have the limit appended to it.
  3. Add a test case where there is a limit already in the query.

Make any needed changes to the adapter repos

Impact to Other Teams

Adapter impact matrix

Adapter Impacted? Notes
Postgres
Snowflake
BigQuery 🚫
Databricks
Redshift

Will backports be required?

Backport to 1.6.latest, 1.5.latest

Context

Original bug report in Slack: https://dbt-labs.slack.com/archives/C05FWBP9X1U/p1692909690843649.

Link to similar issue that was fixed in Core: https://github.com/dbt-labs/dbt-core/issues/7390.

The root cause of the issue is that warehouse providers are implementing the fetchmany method in the DB-API 2.0 spec inefficiently by loading the entire table into memory and filtering client-side.

The long-term fix here is for them to become smarter in how they implement fetchmany, similar to what BigQuery has done: https://github.com/googleapis/google-cloud-python/issues/9185.

davidharting commented 1 year ago

What should the expected behavior be if I run a command like this?

dbt show --inline "select * from my_table limit 500" --limit 10

or like this?

dbt show --inline "select * from my_table limit 10" --limit 500
jtcohen6 commented 1 year ago

@davidharting Agree! I think we'd need to use something like a subquery, to avoid a syntax error that would result here:

# no good
select * from my_table limit 500
limit 10;

# works
select * from (
    select * from my_table limit 500
) subquery
limit 10;

The alternative is trying to parse the query to see if the word limit already appears at the end. That opens up the risk of edge cases IMO.

davidharting commented 1 year ago

Nice that makes sense! So expected behavior is simply that the query we are showing is always honored verbatim, and the limit is applied on an "outer layer" via the subquery mechanism above. Simple, predictable, explicit. Love it.

aranke commented 1 year ago

@davidharting Thanks for pointing this out, I added writing up a test case to the acceptance criteria

dataders commented 1 year ago

The root cause of the issue is that warehouse providers are implementing the fetchmany method in the DB-API 2.0 spec inefficiently by loading the entire table into memory and filtering client-side.

Perhaps nitpicking, but I think this is better characterized like this:

PEP 249 DB API 2.0 provides a specification for fetching resultsets in batches with .fetchmany(). However, it has nothing to say towards providing an API to modify a given SQL statement execution so make the resultset returned by the database smaller. It only defines size as how many rows of the resultset the client should return.

This is the standard and it is the default way that drivers have implemented this feature. This is also how pandas.dataframe.head() works.

Our docs are currently clear that --limit pertains to the number of rows to display

By default, dbt show will display the first 5 rows from the query result. This can be customized by passing the flag --limit n, where n is the number of rows to display.

Usage of fetchmany() with a default size=500 does address client-side OOM issues, but does not enable any sort of "pushdown limit filtering" as #7545 was titled. There's no magic spell to make long-running queries return many rows of data faster, there is no pushdown.

The way to get pushdown is to literally push the limit down into the SQL. My gut says we modify Adapter.execute() to have a conditional that when limit is not None, add a LIMIT clause to the given SQL. we're lucky here imho bc adapters have largely already migrated to the new .execute() signature that includes the limit parameter.

two challenges to this approach:

  1. what about databases like Fabric and Synapse that literally don't have a LIMIT? not a big deal, but this limit inject should be dispatchable to allow things like SELECT TOP N
  2. dbt show and Cloud's preview don't work for materializations that aren't views or tables (eg snapshot, incremental). This makes sense behind the scenes because UPDATE/MERGE/DELETE statements don't return rows to the client. Just calling it out that if we ever want to preview this kinds of models, that this new behavior would also need to be taken into consideration

Last thing is that this smells very much like an interface issue rather than a technical one. I don't blame folks for thinking that limit should edit the query being run (see https://github.com/tlocke/pg8000/issues/133). When a Cloud IDE user clicks "Preview", what are they expecting to happen? Could the quickest win be to clarify that it's not our fault if the query being previewed takes a long time to run? I'd be surprised if DWH query web client IDEs do anything like modifying user-submitted SQL to make things faster for users...

davidharting commented 1 year ago

@dataders

I think your feedback is sort of two-fold here:

I can't weigh in on suggested implementation.

But for dbt CLI users, I do still see this as a problem, if not a bug. dbt show does not persist the results anywhere. Because of that, I think it is a reasonable expectation that dbt show will be efficient and not fetch more data than it knows you want to see.

dave-connors-3 commented 1 year ago

@dataders @aranke @jtcohen6

dbt show and Cloud's preview don't work for materializations that aren't views or tables (eg snapshot, incremental). This makes sense behind the scenes because UPDATE/MERGE/DELETE statements don't return rows to the client. Just calling it out that if we ever want to preview this kinds of models, that this new behavior would also need to be taken into consideration

That is in fact news to me! this would represent a breaking change to the IDE using 1.2, as the strategy in IDE 1.1 when using RPC is more like the dbt show --inline, which allows for the previewing of incremental models. We updated the strategy IDE in 1.2 to use dbt show -s model which was mostly to properly render things like {{ this }}, which are ironically almost exclusively relevant to incremental models!

It sounds like there are some technical justifications for why show may not be able to return rows here, but strictly in the context of the IDE i think we'd want to make sure preview works for all models.

jtcohen6 commented 1 year ago

@dataders @dave-connors-3

dbt show and Cloud's preview don't work for materializations that aren't views or tables (eg snapshot, incremental).

I do not believe this is (or should be) the case! Have you found it to be otherwise?

Incremental models support previewing their compiled SQL (select statement, not wrapped with DDL/DML), using dbt show, just the same as view/table models. There are a few known edge cases, e.g. _dbt_max_partition on BigQuery insert_overwrite models (internal slack thread).

To @davidharting's point: I strongly believe that dbt-core/adapters should be responsible for owning the pushdown of the limit, to ensure that a user's limit 10 query can return a result set of 10 rows as quickly as possible.

Update: @dave-connors-3 @davidharting will be investigating this and getting to the bottom of why incremental models don't appear to be working in the IDE for preview with v1.6.

colin-rogers-dbt commented 1 year ago

I strongly believe that dbt-core/adapters should be responsible for owning the pushdown of the limit, to ensure that a user's limit 10 query can return a result set of 10 rows as quickly as possible.

While I agree this definitely falls onto adapters I do want to temper expectations here, for many queries and data warehouses adding the limit into the query may not significantly impact the query run time. This is because many query engines don't pushdown of limit clauses into sub queries/with statements. The limit is therefore only applied on the result set of the query with the limit clause added. Consider:

with t2 as (
<some expensive computation/join>
)
SELECT t1.*
FROM t1
INNER JOIN t1.id == t2.id 
LIMIT 10

If t2 is the expensive part of this query then the limit 10 will have a negligible impact on the runtime.

Not saying we shouldn't try to add the limit into the query but that doing so is going to be an incremental improvement not a complete fix.