awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
217 stars 135 forks source link

NullPointerException handling retries in delete mode #118

Open mcwqy9 opened 4 years ago

mcwqy9 commented 4 years ago

I am using this lib with Hive on EMR 5.27 and dynamodb.deletion.mode = 'true'. if hive encounters write throttling on the DDB table (even <10 throttles) the task fails with


Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [REMOVED]
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:169)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [REMOVED]
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:565)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
    ... 8 more
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
    at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:251)
    at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:208)
    at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.write(AbstractDynamoDBRecordWriter.java:112)
    at org.apache.hadoop.hive.dynamodb.write.HiveDynamoDBRecordWriter.write(HiveDynamoDBRecordWriter.java:42)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
    ... 9 more
Caused by: java.lang.NullPointerException
    at org.apache.hadoop.dynamodb.DynamoDBClient$4.call(DynamoDBClient.java:273)
    at org.apache.hadoop.dynamodb.DynamoDBClient$4.call(DynamoDBClient.java:252)
    at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
    ... 20 more```

This happens with both mapreduce and tez.
evizitei commented 4 years ago

I suspect that when we add delete requests here:

https://github.com/awslabs/emr-dynamodb-connector/blob/36ce669d052b3cc33d2ee1a993619ec439aa2052/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/DynamoDBClient.java#L223

In the retry loop we assume PutRequests, not DeleteRequests, and that gives us a null when we ask for getPutRequest:

https://github.com/awslabs/emr-dynamodb-connector/blob/36ce669d052b3cc33d2ee1a993619ec439aa2052/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/DynamoDBClient.java#L273

mcwqy9 commented 4 years ago

I've found this can be worked around by setting 'dynamodb.max.batch.items' = '1' but that does cut the throughput per mapper by a factor of 10, in my testing.

mcwqy9 commented 4 years ago

@Tang8330 Thanks so much for writing https://github.com/awslabs/emr-dynamodb-connector/pull/96, it is very useful to me. Would you consider coding a fix for this issue?

foscraig commented 4 years ago

Do you have a sample script where I can solicit this NPE and test a fix? Thanks.

mcwqy9 commented 4 years ago

here is one such example in hive:

SET hivevar:TABLE_NAME=my_table;
SET hivevar:INPUT_S3_PATH=s3://my_bucket/data/;
SET hivevar:OUTPUT_DDB_TABLE_NAME=my-output-table;
SET hivevar:TABLE_COLS=pkey STRING, skey STRING;
SET hivevar:DDB_COLUMN_MAPPING=pkey:pkey,skey:skey;
SET hivevar:OUTPUT_DDB_REGION=us-east-1;

CREATE EXTERNAL TABLE input_${TABLE_NAME}_s3 ( ${TABLE_COLS} ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '${INPUT_S3_PATH}';

CREATE EXTERNAL TABLE output_${TABLE_NAME}_ddb ( ${TABLE_COLS} ) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES(
    'dynamodb.table.name' = '${OUTPUT_DDB_TABLE_NAME}',
    'dynamodb.region' = '${OUTPUT_DDB_REGION}',
    'dynamodb.column.mapping' = '${DDB_COLUMN_MAPPING}',
    'dynamodb.deletion.mode' = 'true',
    'dynamodb.max.batch.items' = '2' -- change to '1' in order to avoid NPE
  );

INSERT OVERWRITE TABLE output_${TABLE_NAME}_ddb SELECT * FROM input_${TABLE_NAME}_s3;

Any use that has a max.batch.items of greater than 1 and that results in write throttling on the table should cause this NPE.

mav787 commented 1 year ago

Hi, do we have a fix for this? I am having same issue as @mcwqy9 , using the workaround batch size = 1 makes Hive job very slow and timeout.