awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 300 forks source link

DDB write_dynamic_frame_from_options fails with requests throttled error #104

Closed OperationalFallacy closed 2 years ago

OperationalFallacy commented 2 years ago

I'm trying to copy ddb table using the read and write sink. This is a small test table with 20 mil items and 500MB in size

Avg item size is 28 bytes.

Read sink works just fine. Job reads items pretty fast with 3-5 workers and it takes a few minutes.

However, the write sink is a disaster. The requests gets throttled and it barely can write 100k records a minute. I tried already all Glue versions.

It actually fails fast with retries exhausted. I had to change "dynamodb.output.retry" to 30-50 because default 10 just fails glue job as soon as it starts writing with: An error occurred while calling o70.pyWriteDynamicFrame. DynamoDB write exceeds max retry 10

This is the sink in python code

def WriteTable(gc, dyf, tableName):
    gc.write_dynamic_frame_from_options(
        frame=dyf,
        connection_type="dynamodb",
        connection_options={
            "dynamodb.output.tableName": tableName,
            "dynamodb.throughput.write.percent": "1"
        }
    )

Where could be a problem?! This is what metrics on table look like when glue is trying to write. image

The table is pretty simple, records of currencies rates like this

{
 "pair": {
  "S": "AMDANG"
 },
 "date": {
  "N": "20080101"
 },
 "value": {
  "N": "0.005845"
 }
}

Thanks!

taimax13 commented 2 years ago

Hey guys any updates on the matter?

moomindani commented 2 years ago

We apologize for delay.

As you can see in the below document, you can configure dynamodb.output.retry to make more retries in case of throttling. You can increase it to the higher value to avoid job failure due to write throttling. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb