awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
216 stars 135 forks source link

Fix reflection logic of aws-java-sdk-v2 credential providers #201

Closed smadurawe-oss closed 1 month ago

smadurawe-oss commented 1 month ago

Issue #, if available: N/A

Description of changes: DynamoDB connector uses reflection to load custom credential providers. When the package was upgraded to use aws-java-sdk-v2, the package was only updated to fix the differing classpaths. SDK-v2 credential providers no longer use constructors but static create() methods to initialize the instance which was not handled in the previous implementation. This PR accounts for this use case and includes a fallback to the original logic to ensure backwards compatibility.

Testing done: mvn clean install

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for EMRDynamoDBConnector 5.5.0-SNAPSHOT:
[INFO]
[INFO] EMRDynamoDBConnector ............................... SUCCESS [  0.258 s]
[INFO] EMRDynamoDBHadoop .................................. SUCCESS [01:01 min]
[INFO] EMRDynamoDBConnectorShims .......................... SUCCESS [  0.003 s]
[INFO] ShimsCommon ........................................ SUCCESS [  0.394 s]
[INFO] Hive2Shims ......................................... SUCCESS [  0.253 s]
[INFO] Hive3Shims ......................................... SUCCESS [  0.137 s]
[INFO] ShimsLoader ........................................ SUCCESS [  0.142 s]
[INFO] EMRDynamoDBHive .................................... SUCCESS [  2.158 s]
[INFO] EMRDynamoDBTools ................................... SUCCESS [  1.086 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:06 min
[INFO] Finished at: 2024-07-25T16:12:29-07:00
[INFO] ------------------------------------------------------------------------

manual testing: old-behavior by setting:

  <property>
          <name>dynamodb.customAWSCredentialsProvider</name>
          <value>org.apache.hadoop.emr.ddb.shaded.software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider</value>
  </property>

running hive query using ddb connector results in:

hive> CREATE EXTERNAL TABLE ddb_features
    >     (feature_id   BIGINT,
    >     feature_name  STRING,
    >     feature_class STRING,
    >     state_alpha   STRING,
    >     prim_lat_dec  DOUBLE,
    >     prim_long_dec DOUBLE,
    >     elev_in_ft    BIGINT)
    > STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
    > TBLPROPERTIES(
    >     "dynamodb.table.name" = "msugath-features",
    >     "dynamodb.column.mapping"="feature_id:Id,feature_name:Name,feature_class:Class,state_alpha:State,prim_lat_dec:Latitude,prim_long_dec:Longitude,elev_in_ft:Elevation"
    > );
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.emr.ddb.shaded.software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider.<init>()

new behavior:

hive> CREATE EXTERNAL TABLE ddb_features
    >     (feature_id   BIGINT,
    >     feature_name  STRING,
    >     feature_class STRING,
    >     state_alpha   STRING,
    >     prim_lat_dec  DOUBLE,
    >     prim_long_dec DOUBLE,
    >     elev_in_ft    BIGINT)
    > STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
    > TBLPROPERTIES(
    >     "dynamodb.table.name" = "msugath-features",
    >     "dynamodb.column.mapping"="feature_id:Id,feature_name:Name,feature_class:Class,state_alpha:State,prim_lat_dec:Latitude,prim_long_dec:Longitude,elev_in_ft:Elevation"
    > );
WARNING: Configured write throughput of the dynamodb table msugath-features is less than the cluster map capacity. ClusterMapCapacity: 10 WriteThroughput: 1
WARNING: Writes to this table might result in a write outage on the table.
OK
Time taken: 2.039 seconds
hive> SELECT DISTINCT feature_class
    > FROM ddb_features
    > ORDER BY feature_class;
Query ID = hadoop_20240725231537_249cb0f1-a82e-4cc6-961e-3ae5383cc15b
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1721942632345_0002)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0
Reducer 3 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 8.28 s
----------------------------------------------------------------------------------------------
OK
Arch
Bar
Basin
Bay
Beach
Bend
Cape
Cliff
Crossing
Falls
Flat
Forest
Gap
Glacier
Island
Lake
Lava
Levee
Range
Ridge
Slope
Spring
Stream
Summit
Swamp
Trail
Valley
Time taken: 11.248 seconds, Fetched: 27 row(s)

new behavior (fallback logic by setting a credential provider using a default constructor):

hive> SELECT DISTINCT feature_class
    > FROM ddb_features
    > ORDER BY feature_class;
WARNING: Configured write throughput of the dynamodb table msugath-features is less than the cluster map capacity. ClusterMapCapacity: 10 WriteThroughput: 1
WARNING: Writes to this table might result in a write outage on the table.
Query ID = hadoop_20240725232003_663627ff-1d03-47c2-9d3e-1ab047672086
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1721942632345_0003)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0
Reducer 3 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 8.27 s
----------------------------------------------------------------------------------------------
OK
Arch
Bar
Basin
Bay
Beach
Bend
Cape
Cliff
Crossing
Falls
Flat
Forest
Gap
Glacier
Island
Lake
Lava
Levee
Range
Ridge
Slope
Spring
Stream
Summit
Swamp
Trail
Valley
Time taken: 12.618 seconds, Fetched: 27 row(s)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.