Writing on AWS S3 bucket not working as expected with Java Spark Dataframe

kharsh032 commented 1 year ago

Describe the bug

I am trying to write to a S3 bucket from a cross account. There is already transit gateway between both the account and all the required access are there. Already cross checked on access part from both side. We are able to list number of object available on S3 bucket but when we are trying to write a spark dataframe on S3 bucket , it is not picking the correct AWS account and role into picture. I cant share account numbers as this is company wide issue we are facing but can share code snippet here. I am trying to assume role and then overriding spark configurations for secret key, access key and session token. then trying to write a data frame on a bucket

Expected Behavior

Ideally, it should assume the role i am trying to assume and write the spark data frame to a csv file into the required S3 bucket

Current Behavior

Currently, i am getting this exception while writing a datframe on S3 bucket.

java.nio.file.AccessDeniedException: s3a://xxxx-temp-bucket/expert-schedule-temp/2023-05-26: getFileStatus on s3a://xxxx-temp-bucket/expert-schedule-temp/2023-05-26: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: AG070SP0KABBK9QK; S3 Extended Request ID: mqAKgNjgQMX+uCXWI0MWDwfZfaZ+/F5GOA7IxtwPGNd4n+p1dRnXHxaOx0sdzD7BqCsxoYZ8B7U=; Proxy: null), S3 Extended Request ID: mqAKgNjgQMX+uCXWI0MWDwfZfaZ+/F5GOA7IxtwPGNd4n+p1dRnXHxaOx0sdzD7BqCsxoYZ8B7U=:403 Forbidden

Reproduction Steps

I cant share the repo and link as it is something which is internal to the company but sharing the code snippet here with you.

public static void writeToFile(SparkSession spark, Properties properties, Dataset scheduleDataDf, String fileFormat) { String path = ""; String roleArn = properties.getProperty("s3.roleArn"); String bucketName = properties.getProperty("s3.bucket");

        String clientRegion = Regions.US_WEST_2.getName();
        String roleSessionName = "AssumeRoleSession1";

        AWSSecurityTokenService stsClient = AWSSecurityTokenServiceClientBuilder.standard()
                .withRegion(clientRegion)
                .build();

        AssumeRoleRequest roleRequest = new AssumeRoleRequest()
                .withRoleArn(roleArn)
                .withRoleSessionName(roleSessionName);
        AssumeRoleResult roleResponse = stsClient.assumeRole(roleRequest);
        Credentials sessionCredentials = roleResponse.getCredentials();
        BasicSessionCredentials awsCredentials = new BasicSessionCredentials(
                sessionCredentials.getAccessKeyId(),
                sessionCredentials.getSecretAccessKey(),
                sessionCredentials.getSessionToken());

        spark.sparkContext().hadoopConfiguration().set("fs.s3a.access.key", sessionCredentials.getAccessKeyId());
        spark.sparkContext().hadoopConfiguration().set("fs.s3a.secret.key", sessionCredentials.getSecretAccessKey());
        spark.sparkContext().hadoopConfiguration().set("fs.s3a.session.token", sessionCredentials.getSessionToken());
        spark.sparkContext().hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

        AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                .withCredentials(new AWSStaticCredentialsProvider(awsCredentials))
                .withRegion(clientRegion)
                .build();

        String current_date = ZonedDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd"));
        String temp_key_prefix = "xxxx-temp-bucket/" + current_date + "/";
        path = "s3a://" + bucketName + "/" + temp_key_prefix;
        scheduleDataDf.coalesce(1).write().format(fileFormat).option("header", "true").mode("overwrite").save(path);
    log.info("File written successfully to path: {}", path);
}

Possible Solution

I dont know . I already talked to AWS support team but they asked to connect to aws-java-sdk team

Additional Information/Context

No response

AWS Java SDK version used

1.1.0

JDK version used

Java 11

Operating System and version

Mac OS

debora-ito commented 1 year ago

If the permissions for the role to be assumed are correct, the one thing I would check is that the credentials used in the API call are actually the credentials you are expecting. You can check this by calling sts get-caller-identity.

I'd also triple-check if the assume role really has the right permissions. Speaking from experience with working in previous cases of access denied errors, oftentimes the role didn't have all the necessary permissions.

kharsh032 commented 1 year ago

Hi Team, I have already tried calling this get-caller-identity and have cross verified all the permissions from service and AWS account side.Everything seems good.. The issue which is happening here is that when we are trying to write a file via spark scheduleDataDf.coalesce(1).write().format(fileFormat).option("header", "true").mode("overwrite").save(path); then at this particular line, we are getting exception. Spark is not picking the correct AWS account which has all the necessary permissions even after overwriting it in code as you can see in my previous comment code snippet. Basically, whatever AWS account we are trying to overwrite via aws-java-sdk , spark is not picking that AWS account and configuration. Can you please help us know the reason for the same. Kindly let me know if we can connect for better understanding of the issue please. Thanks

aws / aws-sdk-java