Open mcwqy9 opened 4 years ago
I suspect that when we add delete requests here:
In the retry loop we assume PutRequests, not DeleteRequests, and that gives us a null when we ask for getPutRequest:
I've found this can be worked around by setting 'dynamodb.max.batch.items' = '1'
but that does cut the throughput per mapper by a factor of 10, in my testing.
@Tang8330 Thanks so much for writing https://github.com/awslabs/emr-dynamodb-connector/pull/96, it is very useful to me. Would you consider coding a fix for this issue?
Do you have a sample script where I can solicit this NPE and test a fix? Thanks.
here is one such example in hive:
SET hivevar:TABLE_NAME=my_table;
SET hivevar:INPUT_S3_PATH=s3://my_bucket/data/;
SET hivevar:OUTPUT_DDB_TABLE_NAME=my-output-table;
SET hivevar:TABLE_COLS=pkey STRING, skey STRING;
SET hivevar:DDB_COLUMN_MAPPING=pkey:pkey,skey:skey;
SET hivevar:OUTPUT_DDB_REGION=us-east-1;
CREATE EXTERNAL TABLE input_${TABLE_NAME}_s3 ( ${TABLE_COLS} ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '${INPUT_S3_PATH}';
CREATE EXTERNAL TABLE output_${TABLE_NAME}_ddb ( ${TABLE_COLS} ) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES(
'dynamodb.table.name' = '${OUTPUT_DDB_TABLE_NAME}',
'dynamodb.region' = '${OUTPUT_DDB_REGION}',
'dynamodb.column.mapping' = '${DDB_COLUMN_MAPPING}',
'dynamodb.deletion.mode' = 'true',
'dynamodb.max.batch.items' = '2' -- change to '1' in order to avoid NPE
);
INSERT OVERWRITE TABLE output_${TABLE_NAME}_ddb SELECT * FROM input_${TABLE_NAME}_s3;
Any use that has a max.batch.items of greater than 1 and that results in write throttling on the table should cause this NPE.
Hi, do we have a fix for this? I am having same issue as @mcwqy9 , using the workaround batch size = 1 makes Hive job very slow and timeout.
I am using this lib with Hive on EMR 5.27 and dynamodb.deletion.mode = 'true'. if hive encounters write throttling on the DDB table (even <10 throttles) the task fails with