awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
216 stars 135 forks source link

Support for arbitrary precision number attributes #167

Open ikonst opened 1 year ago

ikonst commented 1 year ago

In DynamoDB:

Numbers are variable length, with up to 38 significant digits. Leading and trailing zeroes are trimmed. The size of a number is approximately (length of attribute name) + (1 byte per two significant digits) + (1 byte).

The Hive connector chokes up on numbers larger than Long can hold. It should probably be a DECIMAL in Hive and BigDecimal in Java.

For example, for number 11888647184542023637 which is in (2^63, 2^64), we get:

...
        at org.apache.hadoop.hive.dynamodb.DynamoDBObjectInspector.getColumnData(DynamoDBObjectInspector.java:104)
        at org.apache.hadoop.hive.dynamodb.DynamoDBObjectInspector.getStructFieldData(DynamoDBObjectInspector.java:73)
        at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:420)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:133)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:91)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:554)
        ... 18 more
Caused by: java.lang.NumberFormatException: For input string: "11888647184542023637"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Long.parseLong(Long.java:592)
        at java.lang.Long.parseLong(Long.java:631)
        at org.apache.hadoop.hive.dynamodb.util.DynamoDBDataParser.getNumberObject(DynamoDBDataParser.java:240)
        at org.apache.hadoop.hive.dynamodb.type.HiveDynamoDBNumberType.getHiveData(HiveDynamoDBNumberType.java:43)
        at org.apache.hadoop.hive.dynamodb.DynamoDBObjectInspector.getColumnData(DynamoDBObjectInspector.java:98)
        ... 23 more

As a side note, since the connector does not support ProjectionExpression, there's also no way to avoid this attribute (if it was not significant to the query).