awslabs / aws-athena-query-federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.
Apache License 2.0
556 stars 292 forks source link

[BUG] Redshift 'char' column constraint pushdown does not work #62

Closed abhijeetvg closed 2 months ago

abhijeetvg commented 4 years ago

Execution with constraint on char column returns empty results.

Verified that correct value is set in the prepared statement.

Similar push down works in MySql.

avirtuos commented 4 years ago

Can you provide more details here?

  1. What was the table schema used in the test?
  2. What was the query used to reproduce against Athena?
  3. What was the query that was produced against the underlying data store?
avirtuos commented 4 years ago

Depending on the above answers, can we disable pushdown of fields with this type into the source engine to allow correct results even if it means reduced performance / overscan?

abhijeetvg commented 4 years ago

This is the case of postgres as well.

I observed though that if you add padding to the length of the column, queries with these where clauses work.

following works

select * from <some_test_table> where test_char = 'char2                ';

following does not work:

select * from <some_test_table> where test_char = 'char2';
grundprinzip commented 4 years ago

I think this is a specific to Postgres / Redshift because we treat trailing whitespaces of Char columns different than for example Athena. For these systems trailing whitespace is always significant.

There are two ways to deal with this problem:

  1. Add a trim to every predicate and column during the federated subquery execution.
  2. Pad the column for CHAR columns.

Probably it's important to keep in mind that trailing spaces might occur as well in VARCHAR columns.

egriffith commented 1 year ago

Greetings from 2023, I am fairly certain this bug is still present in the latest Athena Connector. Please let me know if I need to cut a different issue

My Reproduction Steps:

1) Create a new redshift database, check the box to allow it to load the sample data. 2) Setup the Athena connector to connect to Redshift's sample_data_dev database.

3) Run the following query in Redshift:

SELECT count(*) as count, i_category
FROM tpcds.catalog_sales JOIN tpcds.item ON i_item_sk=cs_item_sk
                  JOIN tpcds.date_dim ON d_date_sk=cs_sold_date_sk
                  LEFT JOIN tpcds.catalog_returns ON (cs_order_number=cr_order_number 
                                            AND cs_item_sk=cr_item_sk)
WHERE i_category = 'Men'
GROUP BY i_category
order by 1

You'll get back 141762 results.

Run the same query above in Athena using the connector, and you'll get back zero results. The reason is because i_category is padding 47 extra spaces onto the value and Athena is not trimming them out. A query that does work is...

SELECT count(*) as count, i_category
FROM tpcds.catalog_sales JOIN tpcds.item ON i_item_sk=cs_item_sk
                  JOIN tpcds.date_dim ON d_date_sk=cs_sold_date_sk
                  LEFT JOIN tpcds.catalog_returns ON (cs_order_number=cr_order_number 
                                            AND cs_item_sk=cr_item_sk)
WHERE i_category = 'Men                                               '
GROUP BY i_category
order by 1

or

SELECT count(*) as count, i_category
FROM tpcds.catalog_sales JOIN tpcds.item ON i_item_sk=cs_item_sk
                  JOIN tpcds.date_dim ON d_date_sk=cs_sold_date_sk
                  LEFT JOIN tpcds.catalog_returns ON (cs_order_number=cr_order_number 
                                            AND cs_item_sk=cr_item_sk)
WHERE i_category LIKE 'Men%'
GROUP BY i_category
order by 1

However neither option should have to be used. Athena should be trimming the whitespace natively in order to match Redshift's behavior.

AbdulR3hman commented 2 months ago

This has been address in the latest release.