audienceproject / spark-dynamodb

Plug-and-play implementation of an Apache Spark custom data source for AWS DynamoDB.
Apache License 2.0
175 stars 90 forks source link

V0.4.0 tanks write performance #22

Closed colemanja91 closed 5 years ago

colemanja91 commented 5 years ago

Issue: Upgrading to v0.4.0 seems to kill any previous parallelism benefits - only really writing ~25 records at a time.

image

The above screenshot (from DynamoDB console) illustrates. There's a slight bump in the consumed write capacity right before 16:00 - this is using v0.4.0. The spike a few minutes after that is the exact same job, using v0.3.6.

Expected behavior: Parallelism on writes should be same as V0.3.6.

jacobfi commented 5 years ago

Hi colemanja Can you find out what the spark.sparkContext.defaultParallelism is set to in your Spark program? Also, how large (in bytes) is an average row in your dataset? Are you doing writes or updates (flag update=true)?

I can't seem to reproduce your problem right out of the box:

colemanja91 commented 5 years ago

Default parallelism: 72 Average item size: 1,127 bytes We're strictly doing writes - not setting anything other than region, actually:

df
  .write
  .option("region", awsRegion)
  .dynamodb(tableName)
jacobfi commented 5 years ago

Hi colemanja I found the bug. I will be publishing a fix in version 0.4.1 Until then, you can fix the issue by repartitioning your dataset before writing into at least 72 partitions. This would maximize parallelism in any case. Thank you for submitting this issue.

colemanja91 commented 5 years ago

@jacobfi Awesome - thanks for digging into it!