Closed colemanja91 closed 5 years ago
Hi colemanja
Can you find out what the spark.sparkContext.defaultParallelism
is set to in your Spark program?
Also, how large (in bytes) is an average row in your dataset? Are you doing writes or updates (flag update=true
)?
I can't seem to reproduce your problem right out of the box:
Default parallelism: 72 Average item size: 1,127 bytes We're strictly doing writes - not setting anything other than region, actually:
df
.write
.option("region", awsRegion)
.dynamodb(tableName)
Hi colemanja I found the bug. I will be publishing a fix in version 0.4.1 Until then, you can fix the issue by repartitioning your dataset before writing into at least 72 partitions. This would maximize parallelism in any case. Thank you for submitting this issue.
@jacobfi Awesome - thanks for digging into it!
Issue: Upgrading to v0.4.0 seems to kill any previous parallelism benefits - only really writing ~25 records at a time.
The above screenshot (from DynamoDB console) illustrates. There's a slight bump in the consumed write capacity right before 16:00 - this is using v0.4.0. The spike a few minutes after that is the exact same job, using v0.3.6.
Expected behavior: Parallelism on writes should be same as V0.3.6.