dbt-labs / dbt-redshift

dbt-redshift contains all of the code enabling dbt to work with Amazon Redshift
https://getdbt.com
Apache License 2.0
101 stars 59 forks source link

[Regression] Increased execution times with version 1.7 compared to version 1.3 #741

Open lucapazz opened 7 months ago

lucapazz commented 7 months ago

Is this a regression in a recent version of dbt-redshift?

Current Behavior

Using the Jaffle Shop demo project as an example, it can be observed that execution times increase significantly starting from version 1.5:

dbt-redshift v1.3.0 v1.4.1 v1.5.8 v1.6.7 v1.7.4 branch 1.7.latest
dbt run (empty target schema) 23.38s +4% +60% +54% +49% +49%
dbt test 26.89s +0% +18% +18% +20% +18%
dbt run (tables and views already present in the target schema) 26.07s +1% +49% +54% +85% +47%

1.3.0 is the version we currently use and 1.7 is the version we would like to update our dbt projects to. 1.5 is the version in which the redshift-connector replaced psycopg2. 1.7.4 is the version that uses a separate transaction for each SQL command produced by dbt. 1.7.latest is the branch that fixes the issue #693 restoring the sequence of SQL commands and the number of transactions to those of versions < 1.7

Expected/Previous Behavior

Same execution times as version 1.3.

Steps To Reproduce

  1. Clone Jaffle Shop project
  2. Create a Python environment and install dbt-redshift==1.3.0
  3. Create the profiles.yml file in the demo project to connect to your Redshift database (we've used 4 threads)
  4. Run the dbt commands from your terminal and check the execution times
  5. Repeat the previous steps to test other versions

Relevant log output

No response

Environment

- OS: macOS Sonoma 14.2
- Python: 3.10.13
- dbt-core (working version): 1.3.7
- dbt-redshift (working version): 1.3.0
- dbt-core (regression version): 1.7.10
- dbt-redshift (regression version): 1.7.4

Additional Context

No response

dataders commented 7 months ago

Hi @lucapazz, thanks so much for taking the time to document what you're seeing. That table is a work of art!

Within dbt Core and adapters, we haven't ever invested in performance testing or benchmarking. If I could wave a magic wand and introduce performance gating as part of our CI, I would in a heartbeat. This is certainly something that we could consider as a future investment.

Below I'll try to share as much context as I can, please tell me if this is helpful or where you disagree.

end user impact on performance

Jaffle Shop is exactly the project I'd reach for to do this benchmark. However, my gut instinct is that in "real-world" dbt projects in production, the majority of the time comes from the queries in the data warehouse itself. This thought it shared in recent comments a post in r/dataengineering about a project that is dbt but 30X faster.

Before we get into some of the contributors to performance degradation, I'm curious to understand how you feel the "performance degradation" of upgrading from 1.3 to 1.7 would impact your team. Most importantly, how often do you or your team run the entire project? We've certainly seen teams that run into SLA issues when project build times take longer than an hour or so to run. And would the resulting "slowdown" outweigh the benefits of the new features and bug fixes in >=1.7? Or perhaps you're observing a performance "smell" that makes you hesitant to upgrade?

contributors to performance degradation

dbt-redshift depends on dbt-core which itself depends on many other packages. I can't speak with authority to the performance of dbt-core or other implicit dependencies, though maybe @peterallenwebb can chime in here.

I can however, speak to the changes that have been made in dbt-redshift over the past year or so of minor versions.

migrating to redshift-connector from psycopg2

This is the largest contributor to performance. Specifically, lack of support in the driver for simple query cycle (https://github.com/aws/amazon-redshift-python-driver/issues/162) requires all queries to be executed in their own transaction.

metadata views

The Redshift team has, as of late, been shipping incredible new Datasharing cross-database and cross-cluster querying features. This is fundamentally not possible in either:

Datasharing requires database-spanning metadata that information schema tables used traditionally within Redshift (pg_views, pg_tables, etc) do not provide. So now, dbt-redshift must now query the Redshift system tables (svv_*) to get the same information. These metadata calls will not be as performant as single-db calls were. But Redshift tells us that Datasharing and ra3 nodes are the future of the product. They're obviously experts on their product so I'm inclined to agree with them.

Should dbt-redshift chose to both support of the following for Redshift?

With infinite engineering capacity, I'd have zero problem doing so. But alas, there is not, so we must make tradeoffs. We hope to work closely with the Redshift team to maintain the user experience

What's next?

There's an argument to be made for dbt-redshift retaining support for of the below scenarios:

In this world, users in the former scenario Redshift clusters might go back to the performance characteristics of dbt 1.3, but IMO the negatives hugely outweigh this.

In this world, future changes to the Redshift adapter would have to be implemented in two different ways. This is not only more work, but it incurs the risk that it might be impossible to implement new functionality at parity, leading to weird divergences between the two ostensibly-identical adapters.

Additionally, the number of single-db/single cluster Redshift users has surely peaked. People setting up new clusters today will be using the new experience, and over time more and more existing users will switch to RA3 nodes. (Perhaps even you!) So the number of users who would benefit from this additional work are continuously decreasing, making it a poor return on time invested.

Finally, this assumes that the svv_* tables will never get any more performant. Even though there are inherent penalties to cross-cluster queries vs the classic pg_* tables, they're not set in stone as a 50% drop.

To summarize, a better place to direct attention would be improving performance of the queries that dbt runs as part of a run. If redshift-connector exposes higher-performance ways for us to get what we need, we'd be enthusiastic to adopt them.

lucapazz commented 7 months ago

Hi @dataders, thank you so much for your reply.

Before answering your questions, I let you know that I took some time to do further investigation. I repeated my performance test on the _jaffleshop project using dbt-redshift==1.3.0 and then dbt-redshift==1.7.5, in order to compare the queries that were run on Redshift by the two versions of dbt.

I attach a zip file containing the following files:

Some consideration observing those results:

It seems as if the execution time increase is due to less efficient query "scheduling”.

Let's put the _jaffleshop project aside and move on to our use cases...

We populate our DWH using many dbt projects running in distinct Prefect flows. Each project contains a limited number of models and tests, so the execution times of an entire project are not so long. However, in some cases, we have time constraints: for example, we have a Prefect flow that orchestrates many dbt projects and it has to be completed in an hour. Unfortunately, at the moment, execution times are more important than the new features from dbt or Redshift. Speaking of Redshift, we are using RA3 nodes.

If you are interested in having more details on our use cases, we are also available for a meeting to show you our projects :)

lucapazz commented 4 months ago

I tested with dbt 1.8 too. I didn't expect to get back to the execution times of version 1.3, nor to get values ​​worse than version 1.7:

dbt-redshift v1.3.0 v1.7.5 1.8.0
dbt run (empty target schema) 23.38s +49% +89%
dbt test 26.89s +18% +38%
dbt run (tables and views already present in the target schema) 26.07s +47% +73%

What can be the cause?

peterallenwebb commented 3 months ago

@lucapazz Thanks for bringing this to our attention. We have done a lot of work to understand and improve the overall performance of dbt-core in recent releases, but have not spent the same amount of time looking at the performance of specific adapters. I'll try to spend some time looking at Redshift specifically to understand where this extra time is coming from and to see if there is a way we can improve it.

Jaffle shop is not always the best project for judging performance, since it is so small and simple, but in this case the increase in reported query times suggests an underlying problem.