awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
217 stars 135 forks source link

Auto refresh of RCU usage #135

Open sivankumar86 opened 4 years ago

sivankumar86 commented 4 years ago

Issue: Dynamodb export job is running for more than 5 days which causes datapipeline time out due to data skew.

configuration , r5.24xlarge =20 RCU =400k size= ~80Tb maps=2000 maps

70TB exported in around 9 hours and reset of data scanned <10k hence, job runs longer.

have also tried increasing yarn map memory and reduce the node to increase RCU per maps however, it is a trail and error method which takes time and increase emr cost

Solution : It can be mitigated if rcu usage refreshed based on running container with certain interval as only few container runs at end of job for long time and rcu is assigned at start of the job.

Any other suggestion ?