dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.59k stars 1.59k forks source link

[Bug] dbt-postgres transaction management: extraneous `BEGIN` when listing schemas #9434

Open Michael-cd30 opened 7 months ago

Michael-cd30 commented 7 months ago

Is this a new bug in dbt-core?

Current Behavior

"dbt run" on postgres tries to open transactions while another is already in progress. As a result, PostgreSQL displays warnings.

A run on a single model raises 24 warnings.

Expected Behavior

"dbt run" should not try to run neested transactions on PostgreSQL.

Steps To Reproduce

Relevant log output

2024-01-24 12:09:27 2024-01-24 11:09:27.210 UTC [8955] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.224 UTC [8956] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.224 UTC [8957] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.225 UTC [8954] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.268 UTC [8958] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.282 UTC [8961] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.300 UTC [8959] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.300 UTC [8960] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.327 UTC [8963] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.331 UTC [8962] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.351 UTC [8965] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.362 UTC [8964] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.374 UTC [8966] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.393 UTC [8967] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.448 UTC [8970] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.449 UTC [8968] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.449 UTC [8969] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.499 UTC [8971] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.501 UTC [8972] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.502 UTC [8973] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.549 UTC [8974] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.570 UTC [8974] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.654 UTC [8975] WARNING:  there is already a transaction in progress
2024-01-24 12:09:27 2024-01-24 11:09:27.767 UTC [8976] WARNING:  there is already a transaction in progress

Environment

- OS: docker / debian
- Python: 3.11
- dbt: 1.7.4
- dbt-postgres: 1.7.4

Which database adapter are you using with dbt?

postgres

Additional Context

No response

dbeatty10 commented 7 months ago

Thanks for reaching out @Michael-cd30 !

I can see why you'd want to avoid extraneous warning messages in your PG logs. However, unless it it causing errors while building data sets in dbt this is unlikely to be a priority for us. Accordingly, I'm going to label this as "help wanted" for a motivated community member to solve.

alison985 commented 7 months ago

@Michael-cd30 How many threads do you have your dbt project configured for? If you configured dbt for 24 threads but there's a limit on the db user to 1 thread, I could see this scenario happening.

Michael-cd30 commented 7 months ago

Hi @alison985

The used profile (targetdev in screenshot) is configured to run 4 threads. And, for the test I've run only one model so I don't think the problem deals with concurrency.

image

alison985 commented 7 months ago

@Michael-cd30 Technically a database user account could be running queries outside of dbt simultaneously. Also, technically, in some places you can limit total concurrent threads on the database regardless of user. However, I assume you already accounted for those.

I also agree 1 model creation, with no dependencies or tests, having 24 warning messages is strange. It would be great to see

I would have expected to start seeing an exponential back-off timestamps for something that's trying to get a thread. Other things that stand out to me as differences from my normal work(and therefore may be related) are:

Any chance that model has sub-queries or CTEs in it? Or a loop or macro it's calling?

Though honestly, at this point I'd personally go to the database server logs instead of the dbt logs. Figure out what is happening there and then figure out what dbt must be doing to cause that message.

Michael-cd30 commented 7 months ago

@alison985 - We first saw the problem on our servers : a dedicated PostgreSQL server + a Docker server from which Apache Airflow runs DBT. Both are Linux.

In order to write this issue, I've tested in a simpler context, directly from my Windows workstation. In this simple case, the PostgreSQL service runs locally, so I'm sure that no other command is running in parallel.

The query uses CTE as recommended, but remains very simple and loop-free.

The dbt-postgres connector is maintained by dbt-labs and the code is in the same repository as the core, which probably explains why the version is identical, see : https://github.com/dbt-labs/dbt-core/tree/main/plugins/postgres/dbt/adapters/postgres

I've also looked at PostgreSQL logs activating log_statement = 'all' in order to see commands received by the server. It seems that DBT starts by analyzing the contents of all schemas. This analysis starts with two "BEGIN" for each schema. One would be enough and, because the analysis is a simple SELECT, it doesn't even seem useful to open transactions.

Here's an extract:

2024-02-01 23:21:11.910 CET [25336] LOG: instruction : BEGIN 2024-02-01 23:21:11.910 CET [25336] LOG: instruction : BEGIN 2024-02-01 23:21:11.910 CET [25336] ATTENTION: une transaction est déjà en cours 2024-02-01 23:21:11.912 CET [25336] LOG: instruction : / {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "taradata", "target_name": "localdev", "connection_name": "list_taradata_wrh_administratif"} / select 'taradata' as database, tablename as name, schemaname as schema, 'table' as type from pg_tables where schemaname ilike 'wrh_administratif' union all select 'taradata' as database, viewname as name, schemaname as schema, 'view' as type from pg_views where schemaname ilike 'wrh_administratif' union all select 'taradata' as database, matviewname as name, schemaname as schema, 'materialized_view' as type from pg_matviews where schemaname ilike 'wrh_administratif'

2024-02-01 23:21:11.917 CET [9672] LOG: instruction : BEGIN 2024-02-01 23:21:11.917 CET [9672] LOG: instruction : BEGIN 2024-02-01 23:21:11.917 CET [9672] ATTENTION: une transaction est déjà en cours 2024-02-01 23:21:11.918 CET [23792] LOG: instruction : BEGIN 2024-02-01 23:21:11.919 CET [23792] LOG: instruction : BEGIN 2024-02-01 23:21:11.919 CET [23792] ATTENTION: une transaction est déjà en cours 2024-02-01 23:21:11.919 CET [9672] LOG: instruction : / {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "taradata", "target_name": "localdev", "connection_name": "list_taradata_stg_data_laregion_fr"} / select 'taradata' as database, tablename as name, schemaname as schema, 'table' as type from pg_tables where schemaname ilike 'stg_data_laregion_fr' union all select 'taradata' as database, viewname as name, schemaname as schema, 'view' as type from pg_views where schemaname ilike 'stg_data_laregion_fr' union all select 'taradata' as database, matviewname as name, schemaname as schema, 'materialized_view' as type from pg_matviews where schemaname ilike 'stg_data_laregion_fr'

2024-02-01 23:21:11.919 CET [22488] LOG: instruction : BEGIN 2024-02-01 23:21:11.919 CET [22488] LOG: instruction : BEGIN

and so on...

I haven't tested but I'm quite sure the more "schema" my dbt-project will have, the more warning I'll receive.

Regards,

manticoreroko commented 7 months ago

We have a similar problem in greenplum during dbt execution. WARNING ERROR_TEXT: there is already a transaction in progress. Example dbt execute: dbt build --models folder.chain_folder.*

dbeatty10 commented 7 months ago

Thanks for that info @Michael-cd30 and @manticoreroko !

We're unable to be prioritize this ourselves right now, but we'd be willing to give a look at a PR submitted by a community member.

Here's a couple places to start looking:

Theoretically, the code is already supposed to check if there's already a known transaction that is open or not, but it must not be working somehow.

I'd suggest setting threads: 1 within profiles.yml just to be sure multiple threads aren't somehow related.

Michael-cd30 commented 7 months ago

Same with 1 thread.

image

thisisandreeeee commented 4 months ago

@Michael-cd30 were you able to find a fix for this?