awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
217 stars 135 forks source link

Set RCU/WCU for PROVISIONED tables while maintaining support for autoscaling #178

Closed kevnzhao closed 1 year ago

kevnzhao commented 1 year ago

Issue #, if available:

158

Description of changes:

Currently we do not configure DynamoDBConstants.READ_THROUGHPUT and DynamoDBConstants.WRITE_THROUGHPUT for PROVISIONED tables in our DynamoDB InputFormat. The reason for this is because PROVISIONED tables can potentially have auto-scaling enabled on Read and Write capacity and every time a new task starts we want to fetch from DDB the current capacity to make sure we are fully utilizing it. However we also use read and write throughput variables to calculate number of mappers so not setting this property initially will cause only one map task to be launched

These changes make it so that even for provisioned tables we initially set the read throughput and write throughput so that we can still estimate number of mappers that will be needed. However instead we pass a new configuration for PROVISIONED tables that indicated that throughput should be fetched every time a new task starts to account for auto-scaling.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.