aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.94k stars 701 forks source link

Athena read_sql_query provides completely wrong results for qmark style parametrized queries with cache enabled #2956

Closed scimas closed 1 month ago

scimas commented 2 months ago

Describe the bug

Exactly as stated in the title. If you're querying Athena using a qmark style query and use the query cache, and you change the actual parameter values between two query runs, aws-sdk-pandas/awswrangler retrieves the result of the previous query run with incorrect parameter values.

How to Reproduce

import awswrangler as wr

df = wr.athena.read_sql_query(
    sql="SELECT distinct id FROM my_database.my_table WHERE id in (?, ?)",
    database="my_database",
    ctas_approach=False,
    unload_approach=False,
    params=["id1", "id2"],
    paramstyle="qmark",
    workgroup="primary",
    athena_cache_settings={"max_cache_seconds": 300}
)
print(df)

df2 = wr.athena.read_sql_query(
    sql="SELECT distinct id FROM my_database.my_table WHERE id in (?, ?)",
    database="my_database",
    ctas_approach=False,
    unload_approach=False,
    params=["id3", "id4"],
    paramstyle="qmark",
    workgroup="primary",
    athena_cache_settings={"max_cache_seconds": 300}
)
print(df2)

this with a database and table should be enough to reproduce the issue.

Expected behavior

Since the parameters to the query have changed, a fresh query should be run.

Your project

No response

Screenshots

No response

OS

Amazon Linux 2023

Python version

3.12

AWS SDK for pandas version

layer:AWSSDKPandas-Python312-Arm64:12

Additional context

My guess is that this is happening due to aws-sdk-pandas using a hand-rolled cache implementation rather than just letting Athena handle the cache through the ResultReuseConfiguration of StartQueryExecution. As far as I know, Athena does not populate the query parameter values in the GetQueryExecution output. So you get the parametrized query without the parameter values and check the current query against it, which is incorrect.