dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.75k stars 1.62k forks source link

[CT-858] [Enhancement] Connection is always closed after each query #5489

Open joshuataylor opened 2 years ago

joshuataylor commented 2 years ago

Is there an existing issue for this?

Current Behavior

Every time a query is executed, it is then closed. This occurs with 1->XX threads, tested up to 16.

When using dbt-snowflake this causes you to have to re-login every time you issue a query, which if you have hundreds of models this can cause a massive slowdown as authentication to Snowflake is slow, especially when you are a long distance away from the server (Perth, AU -> US East 1 is 250ms for example, so having to reconnect every query when you have hundreds of models is unpleasant).

I have logged an issue on Snowflake here - https://github.com/dbt-labs/dbt-snowflake/issues/201 , but I believe this is on the dbt-core level.

Expected Behavior

A single connection is made. It would be even better if a login request for SF was made in a single thread, then that's reused for all. That would also fix MFA as well, I think? But that is out of scope

Steps To Reproduce

  1. Run dbt-snowflake
  2. Set the threads to say 2
  3. Have 4 queries, execute them all
  4. Run this query:
    select *
    from table(information_schema.login_history_by_user())
    order by event_timestamp desc;

You can see it in the logs as well:

10:24:59.898232 [debug] [ThreadPool]: Opening a new connection, currently in state closed

Relevant log output

10:24:59.898232 [debug] [ThreadPool]: Opening a new connection, currently in state closed

Environment

- OS: Linux, Mac
- Python: 3.10.4/3.9
- dbt: 1.1.1

What database are you using dbt with?

snowflake

Additional Context

No response

jtcohen6 commented 2 years ago

Thanks for opening here as well @joshuataylor!

I think you're right, the behavior going on here is defined in dbt-core, and could be relevant to other adapters as well. It does seem to yield the trickiest complications on Snowflake.

Today, during a dbt run, dbt opens separate connections for:

It then closes each connection as it completes.

Risks to be wary of here:

So the goal here would be to reuse / recycle more connections while dbt is running, while still guaranteeing that at the end of a run, dbt always closes all its connections. (In an ideal case, we'd also handle authentication in a single thread, and be able to reuse that auth across multiple threads — but that feels like it might be out of scope for this effort.)

At its very simplest, the idea would be:

I think this is likely to be a big lift, requiring a deep dive into some internals of dbt's adapter and execution classes that we haven't touched in some time. I'm not sure when we'll be able to prioritize it. I agree that it feels important.

Relevant code

The context managers. The default behavior is to release a connection as soon as it's done being used.

https://github.com/dbt-labs/dbt-core/blob/e03d35a9fc606f5b1fc7e94c0bc04dbe88f558cf/core/dbt/adapters/base/impl.py#L209-L226

Called once for each node that compiles/runs:

https://github.com/dbt-labs/dbt-core/blob/e03d35a9fc606f5b1fc7e94c0bc04dbe88f558cf/core/dbt/task/base.py#L312

"Releasing" a connection, which actually means closing it:

https://github.com/dbt-labs/dbt-core/blob/e03d35a9fc606f5b1fc7e94c0bc04dbe88f558cf/core/dbt/adapters/base/impl.py#L182-L189

https://github.com/dbt-labs/dbt-core/blob/e03d35a9fc606f5b1fc7e94c0bc04dbe88f558cf/core/dbt/adapters/base/connections.py#L182-L195

For a connection with conn_name, check to see if any existing connections by that name, otherwise open a new one:

https://github.com/dbt-labs/dbt-core/blob/e03d35a9fc606f5b1fc7e94c0bc04dbe88f558cf/core/dbt/adapters/base/connections.py#L123-L160

Cleanup that happens at the very end of runnable tasks:

https://github.com/dbt-labs/dbt-core/blob/e03d35a9fc606f5b1fc7e94c0bc04dbe88f558cf/core/dbt/task/runnable.py#L438-L439

joshuataylor commented 2 years ago

I'll have a dig through and see what if I can find an elegant solution that hopefully (🤞) won't impact connections such as Spark.

As an alternative, in dbt-snowflake we could also check if the connection is closed, the token is valid then reuse the connection, as it should still be set on the handle.

As another alternative, if we could solve this on the dbt-snowflake level in the interim this would be a big speed win. We could add the token when logged in to the connection, then add it to connection. This would involve updating the connection contract though, maybe adding a metadata or other key that connectors could use?

jtcohen6 commented 2 years ago

@joshuataylor Ah - so, rather than reusing connections, just reusing the result of authentication?

I don't have a good sense of whether there are any potential security risks with taking that approach. If it works, though, and substantially speeds up the process of opening new connections, that our current dbt-core approach (treating connections as a commodity) might hold muster for the foreseeable future.

I'll reopen the dbt-snowflake issue with that scope in mind, since the changes would be specific to that codebase.

joshuataylor commented 2 years ago

Yes, for now if we can just reuse the token between requests that should be fine.

We still need to make a HTTP request with Snowflake anyway, but using a Keep Alive connection would be faster as we don't have to handshake again. But we can leave that for later.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

jmasonlee commented 1 year ago

We are running into this issue with our dbt on redshift, and it is making building models that are separated out for modularity annoying as each model needs to create a connection to redshift when it is run. If I have 5 separate models that each need a connection, this adds a significant time cost as compared to a single model that only needs one.

I'm wondering if there are any plans to prioritize this?

joshuataylor commented 1 year ago

Could https://github.com/dbt-labs/dbt-snowflake/pull/428 be used? This has been working out great

I also have a local dev version that does a few tricks to cache etc, but my hacks should only be used in development :).

ChiQuang98 commented 1 year ago

Hi @jtcohen6 I confused about this part a little bit:

Today, during a dbt run, dbt opens separate connections for:

  • the caching (metadata) queries it runs at the start of dbt run (connection named 'master')
  • each model it compiles/runs

Could you please help me confirm this? I'm wondering if I can use 'dbt run' to execute a model. If that model refers to other objects, does it still consume only one connection even when it retrieves data from other objects? Thank you in advance.

denised commented 4 months ago

Adding a "We're still interested" note to this item. Right now we are more or less sidelined by the inability to reuse MFA results across multiple queries for dbt.