dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.67k stars 1.6k forks source link

Deadlocks on pg_rewrite when using multiple threads #2801

Closed sminnee closed 2 years ago

sminnee commented 3 years ago

Describe the bug

A clear and concise description of what the bug is. What command did you run? What happened?

Steps To Reproduce

On a DBT setup, with lots of models (I have ~90), run

dbt run --threads 10

Expected behavior

The run succeeds.

Screenshots and log output

Intermittently (1 in 24 times?), I get errors of the following form instead:

Database Error in model job_stack_mapping (models/platform/job_stack_mapping.sql)
  deadlock detected
  DETAIL:  Process 12331 waits for AccessExclusiveLock on relation 44310843 of database 16390; blocked by process 12328.
  Process 12328 waits for AccessExclusiveLock on object 44310897 of class 2618 of database 16390; blocked by process 12323.
  Process 12323 waits for AccessExclusiveLock on relation 44310843 of database 16390; blocked by process 12331.
  HINT:  See server log for query details.
  compiled SQL at target/run/ss_dbt/models/platform/job_stack_mapping.sql

Class 2618 is the pg_schema.pg_rewrite table.

The output of dbt --version:

installed version: 0.18.0
   latest version: 0.18.0

Up to date!

Plugins:
  - bigquery: 0.18.0
  - snowflake: 0.18.0
  - redshift: 0.18.0
  - postgres: 0.18.0

The operating system you're using: CircleCI docker worker

The output of python --version: CircleCI container circleci/python:3.6.5

jtcohen6 commented 3 years ago

Hey @sminnee, thanks for the writeup.

To understand specifically what's happening here, we'd need to know the exact SQL that dbt is running just before the lock is detected. In the ~4% of runs with deadlocks, do you know what query dbt has just run? Is it always the same query? If not, check in the logs the next time this happens.

dbt already does quite a bit of work to understand postgres dependencies and the impact of drop cascade. If you're running any DML or drop statements outside of dbt materializations (e.g. in hooks), that could be the culprit. I don't know of a code change we would make, in that case.

Otherwise, it's plausible that running with more threads on postgres increases the risk of locks. Do you ever see deadlocks when you run with fewer threads, e.g. 5 instead of 10?

sminnee commented 3 years ago

Thanks for the response! Some technical debt makes it a bit awkward to get access to my logs right now, but I'll come back to this ticket when I have the log data.

I lowered threads to 5 and the issue was less frequent but still occurs. I'm going to try lowering to 3, then 2, to see when the problem goes away.

j-hulbert commented 3 years ago

I had a similar issue (dbt 0.18.1) and Postgres. Two threads deadlocked on drop cascade where both tables that were getting dropped had the same views dependent on them. This was intermittent because sometimes both these transactions didn't occur at the same time. Converting those dependent views to be materialized as tables seems to have fixed issue. Maybe dbt could retry the transaction with backoff when this occurs?

github-actions[bot] commented 2 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

sheldonkhall commented 9 months ago

I have started to see this issue occurring in the last month. As @j-hulbert describes dbt runs a drop cascade of tables with __dbt_backup appended as part of the run command. My specific situation can be seen in the image. Randomly I will get my postgres log spitting out an error related to a deadlock:

DETAIL: Process 9906 waits for AccessExclusiveLock on object 422131525 of class 2618 of database 16404; blocked by process 9902.
Process 9902 waits for AccessExclusiveLock on relation 422131522 of database 16404; blocked by process 9906.
Process 9906: /* {"app": "dbt", "dbt_version": "1.7.3", "profile_name": "data_models", "target_name": "prod", "node_id": "model.data_models.accounts_view"} */
drop table if exists "analyticsdb"."mvs__CANDIDATE__3f210b4"."accounts_view__dbt_backup" cascade
Process 9902: /* {"app": "dbt", "dbt_version": "1.7.3", "profile_name": "data_models", "target_name": "prod", "node_id": "model.data_models.ref_institutional_accounts"} */
drop table if exists "analyticsdb"."mvs__CANDIDATE__3f210b4"."ref_institutional_accounts__dbt_backup" cascade

It seems that even though dbt captures the DAG and therefore could know there is a shared downstream dependency. It still tries to drop cascade two tables that have the same downstream dependency. Because of the parallelism there is a chance of a deadlock and I see this in practice regularly enough to be a problem.

Screenshot 2023-12-07 at 12 05 15