[Regression] Increased execution times with version 1.7 compared to version 1.3

lucapazz commented 7 months ago

Is this a regression in a recent version of dbt-redshift?

[X] I believe this is a regression in dbt-redshift functionality
[X] I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

Using the Jaffle Shop demo project as an example, it can be observed that execution times increase significantly starting from version 1.5:

dbt-redshift	v1.3.0	v1.4.1	v1.5.8	v1.6.7	v1.7.4	branch 1.7.latest
`dbt run` (empty target schema)	23.38s	+4%	+60%	+54%	+49%	+49%
`dbt test`	26.89s	+0%	+18%	+18%	+20%	+18%
`dbt run` (tables and views already present in the target schema)	26.07s	+1%	+49%	+54%	+85%	+47%

1.3.0 is the version we currently use and 1.7 is the version we would like to update our dbt projects to. 1.5 is the version in which the redshift-connector replaced psycopg2. 1.7.4 is the version that uses a separate transaction for each SQL command produced by dbt. 1.7.latest is the branch that fixes the issue #693 restoring the sequence of SQL commands and the number of transactions to those of versions < 1.7

Expected/Previous Behavior

Same execution times as version 1.3.

Steps To Reproduce

Clone Jaffle Shop project
Create a Python environment and install dbt-redshift==1.3.0
Create the profiles.yml file in the demo project to connect to your Redshift database (we've used 4 threads)
Run the dbt commands from your terminal and check the execution times
Repeat the previous steps to test other versions

Relevant log output

No response

Environment

- OS: macOS Sonoma 14.2
- Python: 3.10.13
- dbt-core (working version): 1.3.7
- dbt-redshift (working version): 1.3.0
- dbt-core (regression version): 1.7.10
- dbt-redshift (regression version): 1.7.4

Additional Context

No response

dataders commented 7 months ago

Hi @lucapazz, thanks so much for taking the time to document what you're seeing. That table is a work of art!

Within dbt Core and adapters, we haven't ever invested in performance testing or benchmarking. If I could wave a magic wand and introduce performance gating as part of our CI, I would in a heartbeat. This is certainly something that we could consider as a future investment.

Below I'll try to share as much context as I can, please tell me if this is helpful or where you disagree.

end user impact on performance

Jaffle Shop is exactly the project I'd reach for to do this benchmark. However, my gut instinct is that in "real-world" dbt projects in production, the majority of the time comes from the queries in the data warehouse itself. This thought it shared in recent comments a post in r/dataengineering about a project that is dbt but 30X faster.

Before we get into some of the contributors to performance degradation, I'm curious to understand how you feel the "performance degradation" of upgrading from 1.3 to 1.7 would impact your team. Most importantly, how often do you or your team run the entire project? We've certainly seen teams that run into SLA issues when project build times take longer than an hour or so to run. And would the resulting "slowdown" outweigh the benefits of the new features and bug fixes in >=1.7? Or perhaps you're observing a performance "smell" that makes you hesitant to upgrade?

contributors to performance degradation

dbt-redshift depends on dbt-core which itself depends on many other packages. I can't speak with authority to the performance of dbt-core or other implicit dependencies, though maybe @peterallenwebb can chime in here.

I can however, speak to the changes that have been made in dbt-redshift over the past year or so of minor versions.

migrating to `redshift-connector` from `psycopg2`

This is the largest contributor to performance. Specifically, lack of support in the driver for simple query cycle (https://github.com/aws/amazon-redshift-python-driver/issues/162) requires all queries to be executed in their own transaction.

metadata views

The Redshift team has, as of late, been shipping incredible new Datasharing cross-database and cross-cluster querying features. This is fundamentally not possible in either:

Postgres (from which Redshift was originally forked), or
Redshift dc2 node provisioned clusters

Datasharing requires database-spanning metadata that information schema tables used traditionally within Redshift (pg_views, pg_tables, etc) do not provide. So now, dbt-redshift must now query the Redshift system tables (svv_*) to get the same information. These metadata calls will not be as performant as single-db calls were. But Redshift tells us that Datasharing and ra3 nodes are the future of the product. They're obviously experts on their product so I'm inclined to agree with them.

Should dbt-redshift chose to both support of the following for Redshift?

With infinite engineering capacity, I'd have zero problem doing so. But alas, there is not, so we must make tradeoffs. We hope to work closely with the Redshift team to maintain the user experience

What's next?

There's an argument to be made for dbt-redshift retaining support for of the below scenarios:

"classic" single-database, single-cluster postgres-like scenario
forward-looking cross-database, cross-cluster scenarios?

In this world, users in the former scenario Redshift clusters might go back to the performance characteristics of dbt 1.3, but IMO the negatives hugely outweigh this.

In this world, future changes to the Redshift adapter would have to be implemented in two different ways. This is not only more work, but it incurs the risk that it might be impossible to implement new functionality at parity, leading to weird divergences between the two ostensibly-identical adapters.

Additionally, the number of single-db/single cluster Redshift users has surely peaked. People setting up new clusters today will be using the new experience, and over time more and more existing users will switch to RA3 nodes. (Perhaps even you!) So the number of users who would benefit from this additional work are continuously decreasing, making it a poor return on time invested.

Finally, this assumes that the svv_* tables will never get any more performant. Even though there are inherent penalties to cross-cluster queries vs the classic pg_* tables, they're not set in stone as a 50% drop.

To summarize, a better place to direct attention would be improving performance of the queries that dbt runs as part of a run. If redshift-connector exposes higher-performance ways for us to get what we need, we'd be enthusiastic to adopt them.

lucapazz commented 7 months ago

Hi @dataders, thank you so much for your reply.

Before answering your questions, I let you know that I took some time to do further investigation. I repeated my performance test on the _jaffleshop project using dbt-redshift==1.3.0 and then dbt-redshift==1.7.5, in order to compare the queries that were run on Redshift by the two versions of dbt.

I attach a zip file containing the following files:

commands output - dbt 1.3.txt (it reports the output for each command invoked with dbt 1.3)
redshift queries - dbt 1.3.sql (it contains the SQL commands I used to query the Redshift system views in order to get the queries that were run by dbt 1.3)
redshift queries - dbt 1.3.tsv (it contains the results of the SQL commands in the previous file)
commands output - dbt 1.7.txt
redshift queries - dbt 1.7.sql
redshift queries - dbt 1.7.tsv

Some consideration observing those results:

Query execution times do not contribute to the total duration gap.
Given the dbt run command, the difference between the end time of the last query and the start time of the first query with dbt 1.7 is much bigger than with 1.3.
Given the dbt run command, times between consecutive queries to build a model (e.g. create tmp table -> rename final table to backup -> rename tmp table to final name -> drop backup table) are longer with dbt 1.7.
Given the dbt run command with dbt 1.7, queries to rename all final tables are executed only after all queries that create tmp tables have completed, even though "create table" and "alter table" queries for the same model are in the same transaction; the same thing happens between the two table renaming queries.

It seems as if the execution time increase is due to less efficient query "scheduling”.

Let's put the _jaffleshop project aside and move on to our use cases...

We populate our DWH using many dbt projects running in distinct Prefect flows. Each project contains a limited number of models and tests, so the execution times of an entire project are not so long. However, in some cases, we have time constraints: for example, we have a Prefect flow that orchestrates many dbt projects and it has to be completed in an hour. Unfortunately, at the moment, execution times are more important than the new features from dbt or Redshift. Speaking of Redshift, we are using RA3 nodes.

If you are interested in having more details on our use cases, we are also available for a meeting to show you our projects :)

lucapazz commented 4 months ago

I tested with dbt 1.8 too. I didn't expect to get back to the execution times of version 1.3, nor to get values worse than version 1.7:

dbt-redshift	v1.3.0	v1.7.5	1.8.0
`dbt run` (empty target schema)	23.38s	+49%	+89%
`dbt test`	26.89s	+18%	+38%
`dbt run` (tables and views already present in the target schema)	26.07s	+47%	+73%

What can be the cause?

peterallenwebb commented 3 months ago

@lucapazz Thanks for bringing this to our attention. We have done a lot of work to understand and improve the overall performance of dbt-core in recent releases, but have not spent the same amount of time looking at the performance of specific adapters. I'll try to spend some time looking at Redshift specifically to understand where this extra time is coming from and to see if there is a way we can improve it.

Jaffle shop is not always the best project for judging performance, since it is so small and simple, but in this case the increase in reported query times suggests an underlying problem.

dbt-labs / dbt-redshift