Closed aranke closed 11 months ago
What should the expected behavior be if I run a command like this?
dbt show --inline "select * from my_table limit 500" --limit 10
or like this?
dbt show --inline "select * from my_table limit 10" --limit 500
@davidharting Agree! I think we'd need to use something like a subquery, to avoid a syntax error that would result here:
# no good
select * from my_table limit 500
limit 10;
# works
select * from (
select * from my_table limit 500
) subquery
limit 10;
The alternative is trying to parse the query to see if the word limit
already appears at the end. That opens up the risk of edge cases IMO.
Nice that makes sense! So expected behavior is simply that the query we are showing is always honored verbatim, and the limit is applied on an "outer layer" via the subquery mechanism above. Simple, predictable, explicit. Love it.
@davidharting Thanks for pointing this out, I added writing up a test case to the acceptance criteria
The root cause of the issue is that warehouse providers are implementing the
fetchmany
method in the DB-API 2.0 spec inefficiently by loading the entire table into memory and filtering client-side.
Perhaps nitpicking, but I think this is better characterized like this:
PEP 249 DB API 2.0 provides a specification for fetching resultsets in batches with
.fetchmany()
. However, it has nothing to say towards providing an API to modify a given SQL statement execution so make the resultset returned by the database smaller. It only definessize
as how many rows of the resultset the client should return.
This is the standard and it is the default way that drivers have implemented this feature. This is also how pandas.dataframe.head() works.
Our docs are currently clear that --limit
pertains to the number of rows to display
By default, dbt show will display the first 5 rows from the query result. This can be customized by passing the flag --limit n, where n is the number of rows to display.
Usage of fetchmany()
with a default size=500
does address client-side OOM issues, but does not enable any sort of "pushdown limit filtering" as #7545 was titled. There's no magic spell to make long-running queries return many rows of data faster, there is no pushdown.
The way to get pushdown is to literally push the limit down into the SQL. My gut says we modify Adapter.execute() to have a conditional that when limit is not None
, add a LIMIT
clause to the given SQL. we're lucky here imho bc adapters have largely already migrated to the new .execute()
signature that includes the limit
parameter.
two challenges to this approach:
SELECT TOP N
dbt show
and Cloud's preview
don't work for materializations that aren't views or tables (eg snapshot, incremental). This makes sense behind the scenes because UPDATE/MERGE/DELETE statements don't return rows to the client. Just calling it out that if we ever want to preview this kinds of models, that this new behavior would also need to be taken into considerationLast thing is that this smells very much like an interface issue rather than a technical one. I don't blame folks for thinking that limit
should edit the query being run (see https://github.com/tlocke/pg8000/issues/133). When a Cloud IDE user clicks "Preview", what are they expecting to happen? Could the quickest win be to clarify that it's not our fault if the query being previewed takes a long time to run? I'd be surprised if DWH query web client IDEs do anything like modifying user-submitted SQL to make things faster for users...
@dataders
I think your feedback is sort of two-fold here:
I can't weigh in on suggested implementation.
But for dbt CLI users, I do still see this as a problem, if not a bug. dbt show
does not persist the results anywhere. Because of that, I think it is a reasonable expectation that dbt show
will be efficient and not fetch more data than it knows you want to see.
@dataders @aranke @jtcohen6
dbt show and Cloud's preview don't work for materializations that aren't views or tables (eg snapshot, incremental). This makes sense behind the scenes because UPDATE/MERGE/DELETE statements don't return rows to the client. Just calling it out that if we ever want to preview this kinds of models, that this new behavior would also need to be taken into consideration
That is in fact news to me! this would represent a breaking change to the IDE using 1.2, as the strategy in IDE 1.1 when using RPC is more like the dbt show --inline
, which allows for the previewing of incremental models. We updated the strategy IDE in 1.2 to use dbt show -s model
which was mostly to properly render things like {{ this }}
, which are ironically almost exclusively relevant to incremental models!
It sounds like there are some technical justifications for why show may not be able to return rows here, but strictly in the context of the IDE i think we'd want to make sure preview works for all models.
@dataders @dave-connors-3
dbt show and Cloud's preview don't work for materializations that aren't views or tables (eg snapshot, incremental).
I do not believe this is (or should be) the case! Have you found it to be otherwise?
Incremental models support previewing their compiled SQL (select
statement, not wrapped with DDL/DML), using dbt show
, just the same as view/table models. There are a few known edge cases, e.g. _dbt_max_partition
on BigQuery insert_overwrite
models (internal slack thread).
To @davidharting's point: I strongly believe that dbt-core/adapters should be responsible for owning the pushdown of the limit
, to ensure that a user's limit 10
query can return a result set of 10 rows as quickly as possible.
Update: @dave-connors-3 @davidharting will be investigating this and getting to the bottom of why incremental models don't appear to be working in the IDE for preview with v1.6.
I strongly believe that dbt-core/adapters should be responsible for owning the pushdown of the limit, to ensure that a user's limit 10 query can return a result set of 10 rows as quickly as possible.
While I agree this definitely falls onto adapters I do want to temper expectations here, for many queries and data warehouses adding the limit into the query may not significantly impact the query run time. This is because many query engines don't pushdown of limit clauses into sub queries/with statements. The limit is therefore only applied on the result set of the query with the limit clause added. Consider:
with t2 as (
<some expensive computation/join>
)
SELECT t1.*
FROM t1
INNER JOIN t1.id == t2.id
LIMIT 10
If t2 is the expensive part of this query then the limit 10 will have a negligible impact on the runtime.
Not saying we shouldn't try to add the limit into the query but that doing so is going to be an incremental improvement not a complete fix.
Housekeeping
Short description
Most adapters (check impact matrix below) load the entire model into memory while running
dbt show
. The goal of this ticket is to avoid this behavior and only load the relevant rows from the model.Acceptance criteria
limit
into the query instead of using thefetchmany
method fordbt show
.limit
already in the query.Make any needed changes to the adapter repos
Impact to Other Teams
Adapter impact matrix
Will backports be required?
Backport to
1.6.latest
,1.5.latest
Context
Original bug report in Slack: https://dbt-labs.slack.com/archives/C05FWBP9X1U/p1692909690843649.
Link to similar issue that was fixed in Core: https://github.com/dbt-labs/dbt-core/issues/7390.
The root cause of the issue is that warehouse providers are implementing the
fetchmany
method in the DB-API 2.0 spec inefficiently by loading the entire table into memory and filtering client-side.The long-term fix here is for them to become smarter in how they implement
fetchmany
, similar to what BigQuery has done: https://github.com/googleapis/google-cloud-python/issues/9185.