awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
215 stars 134 forks source link

Regarding pushdown filter predicate #174

Open ankit11519 opened 1 year ago

ankit11519 commented 1 year ago

We are using apache spark to connect to Dynamodb using emr-dynamodb-connector. Now, below are my code statements in pyspark:--

dynamoDf = spark.read.option('region', 'REGION')\ .option("tableName", "TABLE_NAME") \ .format("dynamodb")\ .load() dynamoDfFilter = dynamoDf.filter((F.col("colFilter").startswith('ABC')) | (F.col("colFilter").startswith('XYZ'))) print(dynamoDfFilter.count())

So, wanted to know if these filter conditions will also push down to the DynamoDB for server side filtering or will they be applied after full scan operation data being loaded into "dynamoDf"?

kevnzhao commented 1 year ago

Which connector are you using? Predicate Pushdown is only available for Hive table in EMR DDB Hive Connector package. You can find the supported data types and operators at HERE.

@mimaomao feel free to comment if I miss anything.

ankit11519 commented 1 year ago

We are using this connector:-- [Accessing data in Amazon DynamoDB with Apache Spark]

So, we are accessing amazon DynamoDB with apache spark for scan operation. My question is if query is like this :--

dynamoDfFilter = dynamoDf.filter((F.col("colFilter").startswith('ABC')) | (F.col("colFilter").startswith('XYZ')))

print(dynamoDfFilter.count())

When spark 'filter' is being applied in scan query, whether query which is being sent to dynamoDB has 'filterExpressions' associated with it for server side filtering Or it load entire data in scan operation in a dataFrame first and then apply the filter over that dataFrame.

kevnzhao commented 1 year ago

Then you are using Hadoop Connector. Predicate push-down is not available yet. So all data is loaded from DynamoDB.