Open dataders opened 1 year ago
Hi @dataders
thanks for opening this issue. i appreciate you taking the time to do so. As discussed in dbt-labs/dbt-core#39, moving to support simple query protocol requires a non-trivial amount of work on the redshift-connector side, and some pitfalls related to text transfer protocol. Given that, I will raise this request with our redshift driver team and let you know if it can be prioritized/added to our roadmap as we are currently prioritizing other features.
In the mean time, my best recommendation would be to utilize a sql parsing utility like sqlparse to split multi-statements into individual statements than can be executed by redshift-connector to help immediately relieve the pain caused by redshift-connectors sole use of extended query protocol.
I apologize we haven't been able to incorporate this functionality into the project earlier, it's something I've always wanted to see come to fruition since multi-statement execution has been a long running ask from folks on GitHub. 😢
I will provide an update here after discussing with the team! :)
Hey Brooke and OP,
I'm glad to find this discussion here. I'm switching to redshift_connector for similar reasons as described (IAM auth), and my main concern with the multiple statements not being possible (and sqlparse being the workaround) is in the case of a longer set of SQL commands that generate temporary tables along the way. For instance, in this oversimplified example:
query_tmp = """ CREATE TEMP TABLE t1 AS ( SELECT a,b FROM table1 );
SELECT a FROM t1; """
My understanding is that sqlparse will split this into two statements, execute the first, then the second will fail because it doesn't see the temporary table. Similarly, submitting this to redshift_connector directly will lead to the 'cannot insert multiple commands into a prepared statement.'
Am I missing something here? Is there a way to get this to run with some combination of sqlparse and redshift_connector without refactoring the query to make non-temporary tables (and drop them at the end), which may be a time-consuming refactor depending on the size of the codebase?
Appreciate all the work on this!
Hey @merubhanot ,
Here's an example of how sqlparse could be used. Please note I expanded your example a bit to include creation of table1 and insertion of some testing data so it can be run to completion.
This is possible because when using a temporary table, it remains in scope for the duration of your session. You can read a bit more about this in the Official AWS Docs for CREATE TABLE, but the short of it is that these split up sql statements are executed in the same session.
Keyword that creates a temporary table that is visible only within the current session. The table is automatically dropped at the end of the session in which it is created. The temporary table can have the same name as a permanent table. The temporary table is created in a separate, session-specific schema. (You can't specify a name for this schema.) This temporary schema becomes the first schema in the search path, so the temporary table takes precedence over the permanent table unless you qualify the table name with the schema name to access the permanent table. For more information about schemas and precedence, see search_path.
import sqlparse
import redshift_connector
query_tmp = sqlparse.split("""
CREATE TEMP TABLE table1(a int, b int);
INSERT INTO table1 (a, b) VALUES (1,2);
CREATE TEMP TABLE t1 AS (
SELECT a,b
FROM table1
);
SELECT a FROM t1;
""")
with redshift_connector.connect(...) as conn:
with conn.cursor() as cursor:
for stmt in query_tmp:
cursor.execute(stmt)
print(cursor.fetchall())
>>> ([1],)
ask
related to: dbt-labs/dbt-core#39 dbt-labs/dbt-core#46 dbt-labs/dbt-core#82
allow for dbt (and other consumers/users) to make use of this driver in a simple query flow manner. two specific places that introduce friction are:
;
-delimited SQL statements cannot be sent to `cursor.execute()I love @Brooke-white 's summary in https://github.com/aws/amazon-redshift-python-driver/issues/39#issuecomment-915393795. I've been looking for this terminology all week!
background
dbt is a framework that effectively takes
SELECT
statements, and wraps them within boilerplate DDL to create a DAG of tables and views. Only a somewhat auxillary feature, seeds, make use of prepared statements with anINSERT INTO ... VALUES
statement.In some ways, dbt is a stand in for the parsing and preparation that a database might do in an extended query flow pattern. Using both both complicates things greatly.
recently, in https://github.com/dbt-labs/dbt-redshift/pull/251, dbt-redshift migrated away from
psycopg2
toredshift_connector
in order to enablehowever this change has introduced a number of regressions for our users. While we have both:
autocommit=True
for all connections, as well asexample
Error message (full traceback)
counter-counter-argument
@Brooke-white I saw you mention somewhere that the spec for .execute() in the PEP 249 – Python Database API Specification v2.0 lays out that the input may only be a single
operation
and that.executemany()
should be used for multi-operation/-statement queries. However, if I'm reading this right, the purpose of.executemany()
is to execute the the operation over a list of parameter sets. The first argument is still a singularoperation
.sidebar
I'm very unclear on exact meanings of the following terms here:
query
,statement
,operation
further complications
all of this confusion is making it even harder to determine how transactions are handled by default between the simple and extended query flows