awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
216 stars 135 forks source link

Read throughput of table is not getting set correctly in 4.16.0 version #158

Open ganeshashree opened 2 years ago

ganeshashree commented 2 years ago

When Hive tries to read data from DynamoDB backed Hive table using DynamoDBStorageHandler, read throughput is getting set as null despite ReadCapacityUnits being set in ProvisionedThroughput configured in table properties. This is leading to incorrect mappers calculation during split generation. I found this issue in 4.16.0 version and this issue doesn't exist in 4.9.0 version of dynamodb connector.

The following are the relevant Hive logs:

2021-11-25T18:04:12,287 INFO  [f2f33a58-71b4-4bd0-b0e5-6a38d5fe410c main([])]: dynamodb.DynamoDBClient (DynamoDBClient.java:call(136)) - Describe table output: {Table: {AttributeDefinitions: [{AttributeName: id,AttributeType: N}, {AttributeName: version,AttributeType: N}],TableName: volume,KeySchema: [{AttributeName: id,KeyType: HASH}, {AttributeName: version,KeyType: RANGE}],TableStatus: ACTIVE,CreationDateTime: *** ProvisionedThroughput: {LastIncreaseDateTime: ***,NumberOfDecreasesToday: 0,ReadCapacityUnits: 31104,WriteCapacityUnits: 960},TableSizeBytes: 149776006681,ItemCount: 140893823,TableArn: ****,TableId: ******,}}
2021-11-25T18:04:12,287 INFO  [f2f33a58-71b4-4bd0-b0e5-6a38d5fe410c main([])]: dynamodb.DynamoDBStorageHandler (DynamoDBStorageHandler.java:configureTableJobProperties(127)) - Average item size: 1063.04168267902
2021-11-25T18:04:12,287 INFO  [f2f33a58-71b4-4bd0-b0e5-6a38d5fe410c main([])]: dynamodb.DynamoDBStorageHandler (DynamoDBStorageHandler.java:configureTableJobProperties(203)) - Average item size: 1063.04168267902
2021-11-25T18:04:12,287 INFO  [f2f33a58-71b4-4bd0-b0e5-6a38d5fe410c main([])]: dynamodb.DynamoDBStorageHandler (DynamoDBStorageHandler.java:configureTableJobProperties(204)) - Item count: 140893823
2021-11-25T18:04:12,287 INFO  [f2f33a58-71b4-4bd0-b0e5-6a38d5fe410c main([])]: dynamodb.DynamoDBStorageHandler (DynamoDBStorageHandler.java:configureTableJobProperties(205)) - Table size: 149776006681
2021-11-25T18:04:12,287 INFO  [f2f33a58-71b4-4bd0-b0e5-6a38d5fe410c main([])]: dynamodb.DynamoDBStorageHandler (DynamoDBStorageHandler.java:configureTableJobProperties(206)) - Read throughput: null
2021-11-25T18:04:12,287 INFO  [f2f33a58-71b4-4bd0-b0e5-6a38d5fe410c main([])]: dynamodb.DynamoDBStorageHandler (DynamoDBStorageHandler.java:configureTableJobProperties(207)) - Write throughput: null

Split generation log:

2021-11-25 18:04:13,244 [INFO] [InputInitializer {Map 1} #0] |read.AbstractDynamoDBInputFormat|: Read percentage: 0.2
2021-11-25 18:04:13,809 [INFO] [InputInitializer {Map 1} #0] |read.AbstractDynamoDBInputFormat|: Would use 139 segments for size
2021-11-25 18:04:13,809 [INFO] [InputInitializer {Map 1} #0] |read.AbstractDynamoDBInputFormat|: Would use 0 segments for throughput
2021-11-25 18:04:13,809 [INFO] [InputInitializer {Map 1} #0] |read.AbstractDynamoDBInputFormat|: Using computed number of segments: 139
2021-11-25 18:04:13,812 [INFO] [InputInitializer {Map 1} #0] |read.AbstractDynamoDBInputFormat|: Max number of cluster map tasks: 62
2021-11-25 18:04:13,812 [INFO] [InputInitializer {Map 1} #0] |read.AbstractDynamoDBInputFormat|: Configured read throughput: 1
2021-11-25 18:04:13,812 [INFO] [InputInitializer {Map 1} #0] |read.AbstractDynamoDBInputFormat|: Calculated to use 1 mappers

I suspect this commit added in 4.11.0 version. But, don't have much context on the change done. So, need some help in fixing this issue.