awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
217 stars 135 forks source link

Prototype for how to export with specific keys and sort boundary criteria #159

Open billonahill opened 2 years ago

billonahill commented 2 years ago

Prototye that allows the ability to export rows with a given range of rowKey and sortKey criteria using MR. Expects the params set in DynamoDBExport. Assumes bucketing schema of 0..9999, but that can be overridden with dynamodb.row.key.min.value and dynamodb.row.key.max.value.

Configuration like this will trigger the new functionality, which is to query dynamo for explicit rowKeys in parallel via MR:

    jobConf.setInputFormat(MultipleRowKeyExportInputFormat.class); // override defaults
    jobConf.set(DynamoDBConstants.INDEX_NAME, "some-gsi-index");
    jobConf.set(DynamoDBConstants.ROW_KEY_NAME, "some-row-key");
    jobConf.set(DynamoDBConstants.SORT_KEY_NAME, "some-time-field");
    jobConf.setDouble(DynamoDBConstants.ROW_SAMPLE_PERCENT, 0.001);
    jobConf.setLong(DynamoDBConstants.SORT_KEY_MIN_VALUE, 1596170944L);
    jobConf.setLong(DynamoDBConstants.SORT_KEY_MAX_VALUE, 1596310207L);

This is just a prototype but would this be a desirable contribution?