aws / amazon-redshift-python-driver

Redshift Python Connector. It supports Python Database API Specification v2.0.
Apache License 2.0
202 stars 72 forks source link

Programming Error with percentile_cont #188

Closed stegarth closed 5 months ago

stegarth commented 10 months ago

Driver version

2.0.913

Redshift version

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.54052

Client Operating System

Amazon Linux 2

Python version

3.11.4

Table schema

CREATE TABLE IF NOT EXISTS my_schema.cost
(
    recid BIGINT NOT NULL  ENCODE az64
    ,amount_per_year NUMERIC(38,4)   ENCODE az64
    ,insert_dts TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT getdate() ENCODE az64
    ,PRIMARY KEY (recid)
)
DISTSTYLE KEY
DISTKEY (recid)
SORTKEY (recid);

Problem description

  1. Expected behaviour: Query executes and returns the result of the desired calculation
  2. Actual behaviour: Query fails to execute and reports a KeyError and then that the function percentile_cont does not exist
  3. Error message/stack trace: KeyError:

    Traceback (most recent call last):
       File "/.../python3.11/site-packages/redshift_connector/core.py", line 1667, in execute
         ps = cache["ps"][key]
              ~~~~~~~~~~~^^^^^
    KeyError: ('SELECT percentile_cont(%s) WITHIN GROUP (ORDER BY amount_per_year) AS anon_1 FROM my_schema.cost',
              ((<RedshiftOID.FLOAT: 701>, 1, <built-in method pack of _struct.Struct object at 0x7f76175525c0>)))

    Followed by the programming error:

    {
        'S': 'ERROR',
        'C': '42883',
        'M': 'function percentile_cont(numeric, double precision) does not exist',
        'H': 'No function matches the given name and argument types. You may need to add explicit type casts.',
        'F': '../src/pg/src/backend/parser/parse_func.c',
        'L': '1399', 'R': 'ParseFuncOrColumn'
    }
  4. Any other details that can be helpful: The query worked with the psycopg2 but upon changing my driver to redshift_connector it no longer works.

    Python Driver trace logs

    
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    File .../redshift_connector/core.py:1667, in Connection.execute(self, cursor, operation, vals)
    1666 try:
    -> 1667     ps = cache["ps"][key]
    1668     cursor.ps = ps

KeyError: ('SELECT percentile_cont(%s) WITHIN GROUP (ORDER BY amount_per_year) AS anon_1 \nFROM my_schema.cost', ((<RedshiftOID.FLOAT: 701>, 1, <built-in method pack of _struct.Struct object at 0x7f53ce2403b0>),))

During handling of the above exception, another exception occurred:

ProgrammingError Traceback (most recent call last) File /.../sqlalchemy/engine/base.py:1910, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw) 1909 if not evt_handled: -> 1910 self.dialect.do_execute( 1911 cursor, statement, parameters, context 1912 ) 1914 if self._has_events or self.engine._has_events:

File /.../sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context) 735 def do_execute(self, cursor, statement, parameters, context=None): --> 736 cursor.execute(statement, parameters)

File /.../redshift_connector/cursor.py:241, in Cursor.execute(self, operation, args, stream, merge_socket_read) 240 self._c.merge_socket_read = merge_socket_read --> 241 self._c.execute(self, operation, args) 242 except AttributeError as e:

File /.../redshift_connector/core.py:1737, in Connection.execute(self, cursor, operation, vals) 1735 raise e -> 1737 self.handle_messages(cursor) 1739 # We've got row_desc that allows us to identify what we're 1740 # going to get back from this statement.

File /.../redshift_connector/core.py:2005, in Connection.handle_messages(self, cursor) 2004 if self.error is not None: -> 2005 raise self.error

ProgrammingError: {'S': 'ERROR', 'C': '42883', 'M': 'function percentile_cont(numeric, double precision) does not exist', 'H': 'No function matches the given name and argument types. You may need to add explicit type casts.', 'F': '../src/pg/src/backend/parser/parse_func.c', 'L': '1399', 'R': 'ParseFuncOrColumn'}


## Reproduction code
```python
import sqlalchemy as sql
engine = sql.create_engine("redshift+redshift_connector://[user]:[pass]@[redshift-host]:[port]/[db]")
md = sql.MetaData(schema="myschema")
md.bind = engine
T = sql.Table("cost", md, autoload_with=md.bind)
stmt = sql.select(sql.func.percentile_cont(0.5).within_group(T.c.allowed_per_year))
with engine.connect() as conn:
     exe = conn.execute(stmt)
     res = exe.fetchall()

Not sure why it isn't executing the percentile cont function but figured I'd pose the question here.

I've also posted this on the sqlalchemy-redshift repo as issue 286.

Brooke-white commented 10 months ago

Hi @stegarth , thank you for opening this issue. I've determined the cause of this issue. There is a difference in how redshift-connector and psyopg2 execute statements with bind parameters. While psycopg2 applies bind parameters to the statement on the client side, redshift-connector sends the parameterized statement and bind parameters to Redshift server separately. This can cause differences in behavior when Redshift server does not support the use of bind parameters, which is the case here.

I will reach out to the Redshift server team with a feature request to support the use of bind parameters for percentile_cont. In the meantime, my best suggestions would be to revert to using psycopg2 for the time being. While manually executing the statement using sql.text is a workaround, it is not one I'd recommend due to it not providing protection against sql injection. I'll provide updates on the status of the feature request here, as they become available.

stegarth commented 10 months ago

@Brooke-white Thank you for the information I'll revert back to psycopg2 for now.

Any thoughts on a rough time estimate for that kind of feature to get added?

Brooke-white commented 10 months ago

@stegarth, I will check with the team

reference: RedshiftDP-54083

Brooke-white commented 5 months ago

Hi folks, to request prioritization for this behavior I recommend to reach out to AWS Support and reference RedshiftDP-54083.