Open lucapazz opened 7 months ago
Hi @lucapazz, thanks so much for taking the time to document what you're seeing. That table is a work of art!
Within dbt Core and adapters, we haven't ever invested in performance testing or benchmarking. If I could wave a magic wand and introduce performance gating as part of our CI, I would in a heartbeat. This is certainly something that we could consider as a future investment.
Below I'll try to share as much context as I can, please tell me if this is helpful or where you disagree.
Jaffle Shop is exactly the project I'd reach for to do this benchmark. However, my gut instinct is that in "real-world" dbt projects in production, the majority of the time comes from the queries in the data warehouse itself. This thought it shared in recent comments a post in r/dataengineering about a project that is dbt but 30X faster.
Before we get into some of the contributors to performance degradation, I'm curious to understand how you feel the "performance degradation" of upgrading from 1.3
to 1.7
would impact your team. Most importantly, how often do you or your team run the entire project? We've certainly seen teams that run into SLA issues when project build times take longer than an hour or so to run. And would the resulting "slowdown" outweigh the benefits of the new features and bug fixes in >=1.7
? Or perhaps you're observing a performance "smell" that makes you hesitant to upgrade?
dbt-redshift
depends on dbt-core
which itself depends on many other packages. I can't speak with authority to the performance of dbt-core
or other implicit dependencies, though maybe @peterallenwebb can chime in here.
I can however, speak to the changes that have been made in dbt-redshift
over the past year or so of minor versions.
redshift-connector
from psycopg2
This is the largest contributor to performance. Specifically, lack of support in the driver for simple query cycle (https://github.com/aws/amazon-redshift-python-driver/issues/162) requires all queries to be executed in their own transaction.
The Redshift team has, as of late, been shipping incredible new Datasharing cross-database and cross-cluster querying features. This is fundamentally not possible in either:
Datasharing requires database-spanning metadata that information schema tables used traditionally within Redshift
(pg_views
, pg_tables
, etc) do not provide. So now, dbt-redshift
must now query the Redshift system tables (svv_*
) to get the same information. These metadata calls will not be as performant as single-db calls were. But Redshift tells us that Datasharing and ra3 nodes are the future of the product. They're obviously experts on their product so I'm inclined to agree with them.
Should dbt-redshift chose to both support of the following for Redshift?
With infinite engineering capacity, I'd have zero problem doing so. But alas, there is not, so we must make tradeoffs. We hope to work closely with the Redshift team to maintain the user experience
There's an argument to be made for dbt-redshift retaining support for of the below scenarios:
In this world, users in the former scenario Redshift clusters might go back to the performance characteristics of dbt 1.3, but IMO the negatives hugely outweigh this.
In this world, future changes to the Redshift adapter would have to be implemented in two different ways. This is not only more work, but it incurs the risk that it might be impossible to implement new functionality at parity, leading to weird divergences between the two ostensibly-identical adapters.
Additionally, the number of single-db/single cluster Redshift users has surely peaked. People setting up new clusters today will be using the new experience, and over time more and more existing users will switch to RA3 nodes. (Perhaps even you!) So the number of users who would benefit from this additional work are continuously decreasing, making it a poor return on time invested.
Finally, this assumes that the svv_*
tables will never get any more performant. Even though there are inherent penalties to cross-cluster queries vs the classic pg_* tables, they're not set in stone as a 50% drop.
To summarize, a better place to direct attention would be improving performance of the queries that dbt runs as part of a run. If redshift-connector exposes higher-performance ways for us to get what we need, we'd be enthusiastic to adopt them.
Hi @dataders, thank you so much for your reply.
Before answering your questions, I let you know that I took some time to do further investigation.
I repeated my performance test on the _jaffleshop project using dbt-redshift==1.3.0
and then dbt-redshift==1.7.5
, in order to compare the queries that were run on Redshift by the two versions of dbt.
I attach a zip file containing the following files:
commands output - dbt 1.3.txt
(it reports the output for each command invoked with dbt 1.3)redshift queries - dbt 1.3.sql
(it contains the SQL commands I used to query the Redshift system views in order to get the queries that were run by dbt 1.3)redshift queries - dbt 1.3.tsv
(it contains the results of the SQL commands in the previous file)commands output - dbt 1.7.txt
redshift queries - dbt 1.7.sql
redshift queries - dbt 1.7.tsv
Some consideration observing those results:
dbt run
command, the difference between the end time of the last query and the start time of the first query with dbt 1.7 is much bigger than with 1.3.dbt run
command, times between consecutive queries to build a model (e.g. create tmp table -> rename final table to backup -> rename tmp table to final name -> drop backup table) are longer with dbt 1.7.dbt run
command with dbt 1.7, queries to rename all final tables are executed only after all queries that create tmp tables have completed, even though "create table" and "alter table" queries for the same model are in the same transaction; the same thing happens between the two table renaming queries.It seems as if the execution time increase is due to less efficient query "scheduling”.
Let's put the _jaffleshop project aside and move on to our use cases...
We populate our DWH using many dbt projects running in distinct Prefect flows. Each project contains a limited number of models and tests, so the execution times of an entire project are not so long. However, in some cases, we have time constraints: for example, we have a Prefect flow that orchestrates many dbt projects and it has to be completed in an hour. Unfortunately, at the moment, execution times are more important than the new features from dbt or Redshift. Speaking of Redshift, we are using RA3 nodes.
If you are interested in having more details on our use cases, we are also available for a meeting to show you our projects :)
I tested with dbt 1.8 too. I didn't expect to get back to the execution times of version 1.3, nor to get values worse than version 1.7:
dbt-redshift | v1.3.0 | v1.7.5 | 1.8.0 |
---|---|---|---|
dbt run (empty target schema) |
23.38s | +49% | +89% |
dbt test |
26.89s | +18% | +38% |
dbt run (tables and views already present in the target schema) |
26.07s | +47% | +73% |
What can be the cause?
@lucapazz Thanks for bringing this to our attention. We have done a lot of work to understand and improve the overall performance of dbt-core in recent releases, but have not spent the same amount of time looking at the performance of specific adapters. I'll try to spend some time looking at Redshift specifically to understand where this extra time is coming from and to see if there is a way we can improve it.
Jaffle shop is not always the best project for judging performance, since it is so small and simple, but in this case the increase in reported query times suggests an underlying problem.
Is this a regression in a recent version of dbt-redshift?
Current Behavior
Using the Jaffle Shop demo project as an example, it can be observed that execution times increase significantly starting from version 1.5:
dbt run
(empty target schema)dbt test
dbt run
(tables and views already present in the target schema)1.3.0
is the version we currently use and1.7
is the version we would like to update our dbt projects to.1.5
is the version in which theredshift-connector
replacedpsycopg2
.1.7.4
is the version that uses a separate transaction for each SQL command produced by dbt.1.7.latest
is the branch that fixes the issue #693 restoring the sequence of SQL commands and the number of transactions to those of versions < 1.7Expected/Previous Behavior
Same execution times as version
1.3
.Steps To Reproduce
dbt-redshift==1.3.0
profiles.yml
file in the demo project to connect to your Redshift database (we've used 4 threads)Relevant log output
No response
Environment
Additional Context
No response