aws / aws-msk-iam-auth

Enables developers to use AWS Identity and Access Management (IAM) to connect to their Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters.
Apache License 2.0
145 stars 67 forks source link

IRSA (IAM roles for Service Accounts) in EKS pod isn't respected #55

Open maxkochubey opened 2 years ago

maxkochubey commented 2 years ago

I am trying to run Kafka consumer in AWS-managed Kubernetes cluster (EKS) with IAM roles for service accounts feature enabled, but without any luck yet.

EKS cluster works in AWS account with id 111111111111. The consumer should connect from there to the AWS-managed MSK cluster with IAM authentication. The MSK cluster is located in AWS account with id 222222222222. I am using generic Kafka 2.8.1 binaries and "aws-msk-iam-auth" version 1.1.2. Inside the Kubernetes pod container, the library JAR is located in "/opt/kafka-libs" and the environment variable "CLASSPATH=/opt/kafka-libs/*" is exported.

When I am exporting AWS credentials of the IAM user created in account id 222222222222 which have the access to the MSK cluster topics, everything works fine and messages are received:

export AWS_ACCESS_KEY_ID=AKIAEXAMPLEEXAMPLE
export AWS_SECRET_ACCESS_KEY=moo6ouPeih5viDay9ei7eugaejeeHaes0eephe5a
# cat client-iam.properties
security.protocol = SASL_SSL
sasl.mechanism = AWS_MSK_IAM
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

To make my setup more secure and get rid of credentials stored in Kubernetes secrets, I decided to setup IRSA and use it for MSK authentication. In account 111111111111 I've created the IAM role mapped with serviceAccount used by Kubernetes pod (arn:aws:iam::111111111111:role/test-eks-assumer). This role is allowed to assume the IAM role in account 222222222222 which have all required policies attached (actually, the policies are the same as for IAM user whose creds were used previously).

So now, when the pod is started in EKS, it has the following environment vars defined by EKS:

bash-4.2# printenv | grep AWS
AWS_ROLE_ARN=arn:aws:iam::111111111111:role/test-eks-assumer
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token

The assume of the role works as well - I've checked it in the Kubernetes pod started from amazon/aws-cli image with the same serviceAccount/IAM role mapped: aws sts get-caller-identity returns role arn:aws:iam::111111111111:role/test-eks-assumer and aws sts assume-role --role-arn "arn:aws:iam::222222222222:role/test-msk-consumer" --role-session-name "test-cli" works fine.

Since strict AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables are removed, I try to use the following client config:

security.protocol = SASL_SSL
sasl.mechanism = AWS_MSK_IAM
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::222222222222:role/test-msk-consumer" awsRoleSessionName="test-msk" awsDebugCreds=true;
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

Where arn:aws:iam::222222222222:role/test-msk-consumer is the role allowed to be assumed by IAM role arn:aws:iam::111111111111:role/test-eks-assumer which is mapped to the pod's serviceAccount.

But, the library returns the following:

[2022-02-02 20:59:39,215] INFO [Consumer clientId=consumer-1] Failed authentication with b-1.test-msk-iam.j79efy.c4.kafka.ap-northeast-1.amazonaws.com/10.62.42.73 (An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: Failed to find AWS IAM Credentials [Caused by aws_msk_iam_auth_shadow.com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: User: arn:aws:sts::111111111111:assumed-role/AmazonEksWorkerNode/i-01355dc3a5ed5d2e3 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::222222222222:role/test-msk-consumer (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: 86adb7fe-1bcb-410f-84db-de2157e5bcf5; Proxy: null)]) occurred when evaluating SASL token received from the Kafka Broker. Kafka Client will go to AUTHENTICATION_FAILED state.) (org.apache.kafka.common.network.Selector)
[2022-02-02 20:59:39,219] ERROR [Consumer clientId=consumer-1] Connection to node -1 (b-1.test-msk-iam.j79efy.c4.kafka.ap-northeast-1.amazonaws.com/10.62.42.73:9098) failed authentication due to: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: Failed to find AWS IAM Credentials [Caused by aws_msk_iam_auth_shadow.com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: User: arn:aws:sts::111111111111:assumed-role/AmazonEksWorkerNode/i-01355dc3a5ed5d2e3 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::222222222222:role/test-msk-consumer (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: 86adb7fe-1bcb-410f-84db-de2157e5bcf5; Proxy: null)]) occurred when evaluating SASL token received from the Kafka Broker. Kafka Client will go to AUTHENTICATION_FAILED state. (org.apache.kafka.clients.NetworkClient)

As you can see, aws-msk-iam-auth tries to use the IAM role from EKS worker node instance and does not take into account the role which is defined by IRSA in AWS_ROLE_ARN environment variables. For me, it looks very similar to https://github.com/aws/aws-sdk-java-v2/issues/1470.

P.S. BTW, I did not succeed with using of awsDebugCreds option - it just does not have any effect :(

sayantacC commented 2 years ago

@maxkochubey Thanks for the detailed report. I will look into it. However, in the meanwhile, for having awsDebugCreds=true print out the debug output, the client side log level also needs to be set to DEBUG. We did this to make sure that credential debugging which can be sensitive is not turned on by mistake.

sayantacC commented 2 years ago

Would it be possible for you to attach DEBUG level logs from the client ? It will help debug the issue far more easily.

maxkochubey commented 2 years ago

Hi @sayantacC, sure!

The consumer process was started in the next environment:

$ printenv | grep -E 'AWS|KAFKA' | sort
AWS_REGION=ap-northeast-1
AWS_ROLE_ARN=arn:aws:iam::111111111111:role/test-eks-assumer
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
KAFKA_HEAP_OPTS=-Xms512m -Xmx2048m
KAFKA_OPTS=-Dlog4j.configuration=file:/opt/kafka-mm/log4j.properties
$ cat /opt/kafka-mm/log4j.properties
log4j.rootLogger=DEBUG, stderr
log4j.appender.stderr=org.apache.log4j.ConsoleAppender
log4j.appender.stderr.layout=org.apache.log4j.PatternLayout
log4j.appender.stderr.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.appender.stderr.Target=System.err
$ cat /opt/kafka-mm/client-iam.properties
security.protocol = SASL_SSL
sasl.mechanism = AWS_MSK_IAM
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::222222222222:role/test-msk-consumer" awsRoleSessionName="msk-test" awsDebugCreds=true;
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

Here is the consumer debug log: kafka-console-consumer.log

Also I run aws-cli container in the same Kubernetes pod and checked that the role assume works itself:

bash-4.2# printenv | grep AWS
AWS_ROLE_ARN=arn:aws:iam::111111111111:role/test-eks-assumer
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token

bash-4.2# aws sts get-caller-identity
{
    "UserId": "AROA55BHPCBA3F246W7I7:botocore-session-1643875227",
    "Account": "111111111111",
    "Arn": "arn:aws:sts::111111111111:assumed-role/test-eks-assumer/botocore-session-1643875227"
}

bash-4.2# aws sts assume-role --role-arn arn:aws:iam::222222222222:role/test-msk-consumer --role-session-name aws-cli-pod
{
    "Credentials": {
        "AccessKeyId": "ASIAEXAMPLETNSVZ7HV2",
        "SecretAccessKey": "PcLSb5dCu+CGZuAQ7f6a3WmZcMmfgFPNFmy/L3GC",
        "SessionToken": "IQoJb3JpZ2luX2VjEDgaDmFwLXNvdXRoZWFzdC0xIkgwRgIhAPOlEs0qrr4cRVgFWFcONPyGR6HA4Nhf45cdU1zhMS4AAiEA6uNZS9dyqHU1FxCZUsOgRY+YuQWFWFD5YxprF8P/IbEqmAIIcRAAGgw1NjgwMzI5NDQ4NzAiDDqOIb3BN4nMipxU4Cr1AT3b07gAqxmZT/zfvO1XtvFuVVBqBSH2kQVZcPMe/KXiLHkFVmLNTt4JV/O8aRB/OXXjM+m5+OtQ9cDkNaY3mO6+C9JiJf+tO926oXkEy1e3yLTuti8m281AZ37Wa4Lc9ZRNtmSsnlkWuuWJIg/AwYqf2MkI1UgDliWlC+boORau7JxMFjb427JsKrfzbbNht7tO0SJ5xn0uX9ZuU4Dh68K853Kv5AKFFb1gQ1LT3uSmQaILFH0PQHqIPz8V/wplmOuMeNKYCXhysRjkqmgReTRjzw5Q6XwndT1XNLXZzDzEEAUYnUeS/5JA2Fm7j2Ygh4qDEqnbMO+X7o8GOpwBFqf9RGaCYK+VNxW8Egv1HBl4eEtLrNqtBYMBfQaWdlmGBqyIaIdrj1dpDFxlVdeo8z4tYaznyE4OzXAKl+imfkuADljVw/JFb1sKxBlJXBFfmUUg9SXflk5QxmERyP8o/fPcd0MGEv8Z8BGGkeiG4Hf+RLugXIYnMwjRsc00/CgiPW7MuFq0xHeFj42sKK9Km6mXfajoLWlU4Qk8",
        "Expiration": "2022-02-03T09:01:51+00:00"
    },
    "AssumedRoleUser": {
        "AssumedRoleId": "AROAYIQLG4LTBFJG3LJOW:aws-cli-pod",
        "Arn": "arn:aws:sts::222222222222:assumed-role/test-msk-consumer/aws-cli-pod"
    }
}

Thank you!

sayantacC commented 2 years ago

@maxkochubey Thanks for the debug logs.

You have surmised correctly that the Role in account 222222222222 is being assumed using the credentials from the EKSWorkerRole rather than those passed in by IRSA. The chaining of the roles is not doing what the aws cli does.

I will try and look into solving this problem but it is likely to take me some time to make the required change. The required change will most likely involve switching to use the aws sdk v2 credential providers.

In the meanwhile is it possible for you to: Change the IRSA role to be the one that has cross account access? The procedure described here avoids the additional indirection of having the IRSA role assume the cross account role.

maxkochubey commented 2 years ago

Thanks, @sayantacC - will try it and get back with result.

sayantacC commented 2 years ago

Marking this as an enhancement.

TheRhino04 commented 2 years ago

Any progress on updating this to use aws sdk v2 credential providers? We are trying to accomplish the same thing as addressed above.

sayantacC commented 2 years ago

@TheRhino04 Sorry, I have not made progress on this yet. I will try making some progress over the few weeks.

In the meanwhile is it possible for you to try out the suggestion mentioned earlier: Change the IRSA role to be the one that has cross account access? The procedure described here avoids the additional indirection of having the IRSA role assume the cross account role.

eligithubacc commented 2 years ago

Hi @sayantacC , thank you for taking look into this. We are also interested in the feature of using IRSA. Currently only worker node role is used, which is a blocker from security perspective.

Miscreancy commented 2 years ago

Hi @sayantacC - we're also running into this on EKS.

Went a step further and tried to implement the suggestion you mentioned (having the service account bind to a cross-account role, and use MSK IAM without the additional assume). Confirmed that the role was respected from a call via CLI to aws sts get-caller-identity which returned the cross-account role. When attempting to use MSK it failed to respect the role over Web Identity Token at all and still returned errors related to missing permissions for the ec2 node identity.

We're going to be chasing this via our AWS account manager to see if we can get some movement on this. I note you stated 24 days ago that you were going to try to make some progress on it - has any progress been made at all?

sayantacC commented 2 years ago

@Miscreancy, @TheRhino04, @eligithubacc

I have had a chance to make some progress. I have been working on this change in the migrate_to_v2 branch. I have verified that all existing functionality works with this change. However, I have not yet had the chance to setup an EKS cluster with IRSA to test it on. I will try to get the test setup soon and then work on the release.

In the meanwhile, if you wish to give that branch a try, I would love to learn if it solves your problem.

aidan-melen commented 1 year ago

We got IRSA to work by letting the pod use the default role chain i.e. not specifying awsRoleArn.

Please see https://github.com/aidanmelen/terraform-kubernetes-confluent-platform/blob/main/examples/hybrid_aws_msk/confluent_platform_sasl_iam_secure/main.tf#L40-L57 for more information.

stalbot15 commented 1 year ago

Any update here? The given workarounds are not sufficient for my use case. My pod reads from one MSK cluster cross-account and writes to another MSK cluster in the same account as the EKS cluster.

It's possible the federated access works, but due to security restrictions within my organization, I am not able to give federated access to OIDC providers cross-account. Therefore, sts:AssumeRole is the preferred method of cross-account access.

d-t-w commented 1 year ago

Hi @stalbot15, in my case our issue was directly related to this ticket that we raised in aws-sdk-java-v2 project.

https://github.com/aws/aws-sdk-java-v2/issues/3555

If you are experiencing the same dependency-related IRSA killing issue as us you may be able to circumvent it with some of the approaches listed on that ticket.

Basically if you include IAM Auth and AWS Glue libraries in a project that uses IRSA you will have a bad time unless you take further action.

wildtapir commented 1 year ago

@Miscreancy, @TheRhino04, @eligithubacc

I have had a chance to make some progress. I have been working on this change in the migrate_to_v2 branch. I have verified that all existing functionality works with this change. However, I have not yet had the chance to setup an EKS cluster with IRSA to test it on. I will try to get the test setup soon and then work on the release.

In the meanwhile, if you wish to give that branch a try, I would love to learn if it solves your problem.

Thank you @sayantacC for the updates. I would be happy to test your code on EKS. How should I configure pom.xml for testing your changes?

wildtapir commented 1 year ago

@sayantacC I am trying to build branch migrate_to_v2 locally with java 17 and I get

> Task :compileJava FAILED
./aws-msk-iam-auth/src/main/java/software/amazon/msk/auth/iam/internals/AuthenticationResponse.java:28: error: cannot find symbol
@Getter(onMethod = @__(@JsonIgnore))
                    ^
  symbol: class __
1 error

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':compileJava'.
> java.lang.IllegalAccessError: class lombok.javac.apt.LombokProcessor (in unnamed module @0x2e26bd7f) cannot access class com.sun.tools.javac.processing.JavacProcessingEnvironment (in module jdk.compiler) because module jdk.compiler does not export com.sun.tools.javac.processing to unnamed module @0x2e26bd7f

gradle:

------------------------------------------------------------
Gradle 7.6.1
------------------------------------------------------------

Build time:   2023-02-24 13:54:42 UTC
Revision:     3905fe8ac072bbd925c70ddbddddf4463341f4b4

Kotlin:       1.7.10
Groovy:       3.0.13
Ant:          Apache Ant(TM) version 1.10.11 compiled on July 10 2021
JVM:          17.0.6 (Azul Systems, Inc. 17.0.6+10-LTS)
OS:           Mac OS X 13.3.1 aarch64
rajarshp commented 8 months ago

@sayantacC We are facing the same issue - I raised a new ticket for this as there is no update for last 1 year

https://github.com/aws/aws-msk-iam-auth/issues/159