Open joshuataylor opened 2 years ago
Thanks for opening here as well @joshuataylor!
I think you're right, the behavior going on here is defined in dbt-core
, and could be relevant to other adapters as well. It does seem to yield the trickiest complications on Snowflake.
Today, during a dbt run
, dbt opens separate connections for:
dbt run
(connection named 'master'
)It then closes each connection as it completes.
Risks to be wary of here:
dbt run
completes, and leaves a connection open, on data warehouses that support "auto-resume" compute, it can have real $$ implications: https://github.com/dbt-labs/dbt-spark/issues/280, https://github.com/dbt-labs/dbt-spark/pull/285So the goal here would be to reuse / recycle more connections while dbt is running, while still guaranteeing that at the end of a run, dbt always closes all its connections. (In an ideal case, we'd also handle authentication in a single thread, and be able to reuse that auth across multiple threads — but that feels like it might be out of scope for this effort.)
At its very simplest, the idea would be:
set_connection_name
should try to grab a "done" connection from the pool, rename it, and use itI think this is likely to be a big lift, requiring a deep dive into some internals of dbt's adapter and execution classes that we haven't touched in some time. I'm not sure when we'll be able to prioritize it. I agree that it feels important.
The context managers. The default behavior is to release a connection as soon as it's done being used.
Called once for each node that compiles/runs:
"Releasing" a connection, which actually means closing it:
For a connection with conn_name
, check to see if any existing connections by that name, otherwise open a new one:
Cleanup that happens at the very end of runnable tasks:
I'll have a dig through and see what if I can find an elegant solution that hopefully (🤞) won't impact connections such as Spark.
As an alternative, in dbt-snowflake we could also check if the connection is closed, the token is valid then reuse the connection, as it should still be set on the handle.
As another alternative, if we could solve this on the dbt-snowflake level in the interim this would be a big speed win. We could add the token when logged in to the connection, then add it to connection. This would involve updating the connection contract though, maybe adding a metadata or other key that connectors could use?
@joshuataylor Ah - so, rather than reusing connections, just reusing the result of authentication?
I don't have a good sense of whether there are any potential security risks with taking that approach. If it works, though, and substantially speeds up the process of opening new connections, that our current dbt-core
approach (treating connections as a commodity) might hold muster for the foreseeable future.
I'll reopen the dbt-snowflake
issue with that scope in mind, since the changes would be specific to that codebase.
Yes, for now if we can just reuse the token between requests that should be fine.
We still need to make a HTTP request with Snowflake anyway, but using a Keep Alive connection would be faster as we don't have to handshake again. But we can leave that for later.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
We are running into this issue with our dbt on redshift, and it is making building models that are separated out for modularity annoying as each model needs to create a connection to redshift when it is run. If I have 5 separate models that each need a connection, this adds a significant time cost as compared to a single model that only needs one.
I'm wondering if there are any plans to prioritize this?
Could https://github.com/dbt-labs/dbt-snowflake/pull/428 be used? This has been working out great
I also have a local dev version that does a few tricks to cache etc, but my hacks should only be used in development :).
Hi @jtcohen6 I confused about this part a little bit:
Today, during a
dbt run
, dbt opens separate connections for:
- the caching (metadata) queries it runs at the start of
dbt run
(connection named'master'
)- each model it compiles/runs
Could you please help me confirm this? I'm wondering if I can use 'dbt run' to execute a model. If that model refers to other objects, does it still consume only one connection even when it retrieves data from other objects? Thank you in advance.
Adding a "We're still interested" note to this item. Right now we are more or less sidelined by the inability to reuse MFA results across multiple queries for dbt.
Is there an existing issue for this?
Current Behavior
Every time a query is executed, it is then closed. This occurs with 1->XX threads, tested up to 16.
When using dbt-snowflake this causes you to have to re-login every time you issue a query, which if you have hundreds of models this can cause a massive slowdown as authentication to Snowflake is slow, especially when you are a long distance away from the server (Perth, AU -> US East 1 is 250ms for example, so having to reconnect every query when you have hundreds of models is unpleasant).
I have logged an issue on Snowflake here - https://github.com/dbt-labs/dbt-snowflake/issues/201 , but I believe this is on the dbt-core level.
Expected Behavior
A single connection is made. It would be even better if a login request for SF was made in a single thread, then that's reused for all. That would also fix MFA as well, I think? But that is out of scope
Steps To Reproduce
You can see it in the logs as well:
Relevant log output
Environment
What database are you using dbt with?
snowflake
Additional Context
No response