aws / aws-msk-iam-auth

Enables developers to use AWS Identity and Access Management (IAM) to connect to their Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters.
Apache License 2.0
138 stars 65 forks source link

Re-authentication fails with OAUTHBEARER when awsRoleArn is specified #143

Closed jamielwhite closed 6 months ago

jamielwhite commented 7 months ago

I have a Kafka consumer which is failing to re-authenticate. The consumer works for the first 15 minutes, but it fails once the credentials expire despite the logs indicating it has refreshed the credentials. I've replicated this issue with a Java application as well as the kafka-console-consumer.

Here are the logs indicating the login is refreshed at the 10 minute mark, but the consumer fails to re-authenticate at 15 minutes:

2023-11-28T16:26:46.206730136Z [main] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - Successfully logged in.
2023-11-28T16:26:46.210633781Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - [Principal=:kafka-cluster]: Expiring credential re-login thread started.
2023-11-28T16:26:46.227207550Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - [Principal=kafka-cluster]: Expiring credential valid from Tue Nov 28 16:26:46 GMT 2023 to Tue Nov 28 16:41:45 GMT 2023
2023-11-28T16:26:46.227312837Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - [Principal=kafka-cluster]: Proposed refresh time of Tue Nov 28 16:39:17 GMT 2023 extends into the desired buffer time of 300 seconds before expiration, so refresh it at the desired buffer begin point, at Tue Nov 28 16:36:45 GMT 2023
2023-11-28T16:26:46.227352451Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - [Principal=:kafka-cluster]: Expiring credential re-login sleeping until: Tue Nov 28 16:36:45 GMT 2023
2023-11-28T16:26:46.309291641Z [main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka version: 3.6.0
...
2023-11-28T16:36:45.017686979Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - Initiating re-login for kafka-cluster, logout() still needs to be called on a previous login = true
2023-11-28T16:36:45.020997012Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - [Principal=kafka-cluster]: Expiring credential valid from Tue Nov 28 16:36:45 GMT 2023 to Tue Nov 28 16:51:45 GMT 2023
2023-11-28T16:36:45.021021626Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - [Principal=kafka-cluster]: Proposed refresh time of Tue Nov 28 16:49:18 GMT 2023 extends into the desired buffer time of 300 seconds before expiration, so refresh it at the desired buffer begin point, at Tue Nov 28 16:46:45 GMT 2023
2023-11-28T16:36:45.021094277Z [kafka-expiring-relogin-thread-kafka-cluster] INFO org.apache.kafka.common.security.oauthbearer.internals.expiring.ExpiringCredentialRefreshingLogin - [Principal=:kafka-cluster]: Expiring credential re-login sleeping until: Tue Nov 28 16:46:45 GMT 2023
2023-11-28T16:41:45.975747819Z [Thread-1] INFO org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Failed re-authentication with <cluster endpoint> (channelId=2147483645) (Unexpected handshake request with client mechanism OAUTHBEARER, enabled mechanisms are [])
2023-11-28T16:41:45.976673530Z [Thread-1] ERROR org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Connection to node 2147483645 (<cluster endpoint>) failed authentication due to: Unexpected handshake request with client mechanism OAUTHBEARER, enabled mechanisms are []
2023-11-28T16:41:45.977670864Z org.apache.kafka.common.errors.IllegalSaslStateException: Unexpected handshake request with client mechanism OAUTHBEARER, enabled mechanisms are []
2023-11-28T16:41:46.776948991Z [Thread-1] ERROR org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Connection to node 1 (<cluster endpoint>) failed authentication due to: [7f0b6b38-62f2-41a7-87df-f542194eb8b7]: Session too short

This issue does not occur if I remove awsRoleArn from sasl.jaas.config, but the re-authentication fails if I include it. Here's what the properties file looks like:

security.protocol=SASL_SSL
sasl.mechanism=OAUTHBEARER
sasl.jaas.config=org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule required awsRoleArn="<role arn>" awsStsRegion="<region>";
sasl.login.callback.handler.class=software.amazon.msk.auth.iam.IAMOAuthBearerLoginCallbackHandler

Kafka version: 3.6.0 aws-msk-iam-auth version: 2.0.0

jamielwhite commented 7 months ago

I enabled DEBUG logging, and I see the application re-authenticates successfully but with a decreasing session timeout each time, until it's too short:

2023-11-28T23:33:16.653079751Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished authentication with session expiration in 898349 ms and session re-authentication on or after 836777 ms
2023-11-28T23:33:16.733016243Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished authentication with session expiration in 898268 ms and session re-authentication on or after 796444 ms
2023-11-28T23:33:19.903645191Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished authentication with session expiration in 895098 ms and session re-authentication on or after 763044 ms
2023-11-28T23:46:03.117576552Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished re-authentication with session expiration in 131883 ms and session re-authentication on or after 113503 ms
2023-11-28T23:46:33.305042537Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished re-authentication with session expiration in 101696 ms and session re-authentication on or after 89661 ms
2023-11-28T23:47:56.915128578Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished re-authentication with session expiration in 18086 ms and session re-authentication on or after 16443 ms
2023-11-28T23:48:03.451621377Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished re-authentication with session expiration in 11549 ms and session re-authentication on or after 9974 ms
2023-11-28T23:48:13.552575825Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished re-authentication with session expiration in 1449 ms and session re-authentication on or after 1329 ms
2023-11-28T23:48:13.591081907Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished re-authentication with session expiration in 1410 ms and session re-authentication on or after 1302 ms
...
2023-11-28T23:48:15.382982876Z [Thread-1] ERROR org.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Connection to node 2147483645 (<cluster endpoint>) failed authentication due to: [18de9fc3-f444-4302-8487-1c38b629702c]: Session too short
sankalpbhatia commented 7 months ago

Thank you for raising this. Great details!

We will deep dive on this asap and get back.

jamielwhite commented 7 months ago

I updated our application to leave off awsRoleArn, and it shows a much longer expiration time (1 hour vs 15 minutes). So it's possible I didn't wait long enough to see if the issue occurs when awsRoleArn is not included in sasl.jaas.config. I'll keep it running with this configuration to see if there are any reductions in the session expiration.

2023-11-29T00:53:09.024953992Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished authentication with session expiration in 3598977 ms and session re-authentication on or after 3228285 ms
2023-11-29T00:53:09.113288090Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished authentication with session expiration in 3598888 ms and session re-authentication on or after 3194801 ms
2023-11-29T00:53:12.238079044Z [Thread-1] DEBUG org.apache.kafka.common.security.authenticator.SaslClientAuthenticator - [Consumer clientId=consumer-test-group-2-1, groupId=test-group-2] Finished authentication with session expiration in 3595763 ms and session re-authentication on or after 3284138 ms
sankalpbhatia commented 7 months ago

I was able to repro the issue locally, and I have a suspicion it might only be isolated to cases where you provide an awsRoleArn to the jaas config.

My hypothesis is that since STSAssumeRoleSessionCredentialsProvider used in https://github.com/aws/aws-msk-iam-auth/blob/main/src/main/java/software/amazon/msk/auth/iam/internals/MSKCredentialProvider.java holds onto aws credentials until sometime before expiry, it returns the same credentials when the ExpiringCredentialRefreshingLogin class looks for new credentials after 10 mins.

@jamielwhite were you able to get results for your last run (without passing awsRoleArn) ? If not, I will try to run a similar consumer app and observe results tomorrow.

jamielwhite commented 7 months ago

Yes, the same issue eventually happened without passing awsRoleArn.

sankalpbhatia commented 7 months ago

@jamielwhite we are trying a couple of things for this. Will share an update soon.

sankalpbhatia commented 7 months ago

@jamielwhite can you share the setup which led to this issue without using awsRoleArn? Would be helpful to know the client properties you used in that case, and what credentials you ended up using at that point? Also, can you also share the session expiration in those cases?

jamielwhite commented 7 months ago

The setup was the same with awsRoleArn excluded, and the credentials we used for that case had full kafka-cluster:* permissions. The session expiration was 1 hour (we're running Kubernetes pods and getting our credentials through a service account).

security.protocol=SASL_SSL
sasl.mechanism=OAUTHBEARER
sasl.jaas.config=org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule required;
sasl.login.callback.handler.class=software.amazon.msk.auth.iam.IAMOAuthBearerLoginCallbackHandler
sankalpbhatia commented 6 months ago

Hi @jamielwhite, I just pushed a fix for this issue. Would you be able to build a jar locally and verify if this works for you? We will have a release for this in the next few days.

jamielwhite commented 6 months ago

Thanks @sankalpbhatia! We've made it a few minutes past the 15 minute mark where it initially failed, so it looks like this is working. I also have a consumer running without awsRoleArn, so I'll let you know if we run into any issues there after an hour.

jamielwhite commented 6 months ago

This is unrelated to the issue, but does AWS have any plans to release a SASL signer library in Ruby now that it's supported in other languages like Python and Go?

sankalpbhatia commented 6 months ago

There are no plans to release a signer library in Ruby right now.

sankalpbhatia commented 6 months ago

Fixed in https://github.com/aws/aws-msk-iam-auth/releases/tag/v2.0.2

github-actions[bot] commented 6 months ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

jjatinggoyal commented 2 months ago

@sankalpbhatia someone wrote the ruby signer library: https://github.com/bruce-szalwinski-he/aws-msk-iam-sasl-signer-ruby. Please consider moving this to aws org.

More details are present in https://github.com/aws/aws-sdk-ruby/discussions/2985