awslabs / aws-c-auth

C99 library implementation of AWS client-side authentication: standard credentials providers and signing.
Apache License 2.0
41 stars 32 forks source link

`aws_credentials_provider_new_chain_default` hangs without ever calling credentialsCallBack #185

Closed daniel347x closed 1 year ago

daniel347x commented 1 year ago

Issue:

The function aws_credentials_provider_new_chain_default hangs (its callback is never invoked) in the code sample that is shown below in a Kubernetes/EKS context.

Background:

The Kubernetes/EKS setup for obtaining credentials using IRSA requires STS::AssumeRoleWithWebIdentity functionality in conjunction with a proper EKS service account setup.

Additionally, EKS uses the same JWT method that web browsers do, and hence TLS with the credential authority needs to be set up as well, as the code below does, "as though" the EKS node is an end-user website, so to speak. The JWT token in the case of EKS nodes is located in this scenario not in a web browser's cache, but instead in the running pods at the following path: /var/run/secrets/eks.amazonaws.com/serviceaccount/token .

(You don't need to do this for an EC2 instance that has been granted the same service account role outside of EKS.)

Additionally, notice the documentation for aws_credentials_provider_new_chain_default (from the source code at /include/aws/auth/credentials.h):

(Documentation for aws_credentials_provider_new_chain_default)

Creates the default provider chain used by most AWS SDKs. Generally: (1) Environment (2) Profile (3) STS web identity

...So this function claims to support STS web identity (3rd option). Even more so in that case, it should not hang.

In the code below, unfortunately this function IS hanging (in the sense that the credentialsCallback is never called).

I suspect the function is hanging for one of two reasons (I haven't done testing to determine which):

  1. When the TLS data structure is not set up (and just left as nullptr)
  2. The function just doesn't support the STS::AssumeRoleWithWebIdentity functionality in this EKS context (despite its claim above)

In either scenario, the function should not hang. At the very least, it should invoke the callback with some kind of error message.

But, in particular because the function supports the STS web identity method, if the problem is the TLS structure not being set, this function should notice that the EKS JWT token is available on the system and therefore if the TLS settings are not initialized properly it should be aware of this and invoke the callback with an error message stating something like when EKS JWT tokens are in use, the TLS context in the provider chain options must be initialized.

Here is a short code snippet exhibiting both the bug, and a working version (see the #if IS_USING_KUBERNETES block):

int EnclaveCommander::readCredentials()
{
    // ...
    struct aws_credentials_provider *provider_chain = nullptr;
    // ...
    auto cleanup = [&](){/* ... */}
    // ...
    struct aws_client_bootstrap_options bootstrap_options = {/* ... */};
    struct aws_client_bootstrap *bootstrap = aws_client_bootstrap_new(m_app_ctx.allocator, &bootstrap_options);
    // ...
    struct aws_credentials_provider_sts_web_identity_options chain_options = {.bootstrap=bootstrap /* ... */};
#if IS_USING_KUBERNETES == 1
    fprintf(stderr, "EKS context.\n");
    ###
    # This code block works. There are two changes:
    # 1. The TLS structure is set up
    # 2. The function `aws_credentials_provider_new_sts_web_identity` is called,
    #     rather than the `aws_credentials_provider_new_chain_default` function,
    #     because I know that `STS::AssumeRoleWithWebIdentity` is required,
    #     so no need to call the generic function (as it is below)
    ###
    struct aws_tls_ctx_options tls_options;
    aws_tls_ctx_options_init_default_client(&tls_options, m_app_ctx.allocator);
    chain_options.tls_ctx = aws_tls_client_ctx_new(m_app_ctx.allocator, &tls_options);
    chain_options.function_table = nullptr; // For mocking the http layer in tests, leave NULL otherwise
    provider_chain = aws_credentials_provider_new_sts_web_identity(m_app_ctx.allocator, &chain_options);
#else
    fprintf(stderr, "Human-at-the-keyboard context.\n");
    ###
    # This code exhibits the problem.
    # With `aws_credentials_provider_new_chain_default` called here,
    # the following call to `aws_credentials_provider_get_credentials` hangs -
    # (i.e, its callback is never invoked)
    ###
    provider_chain = aws_credentials_provider_new_chain_default(m_app_ctx.allocator, &chain_options);
#endif
    // ...
    ###
    # This is the function that passes the callback that is never called
    ###
    rc = aws_credentials_provider_get_credentials(provider_chain, credentialsCallBack, &m_app_ctx);
    // ...
    aws_mutex_lock(&m_app_ctx.mutex);
    ###
    # This is the function that hangs.
    # It hangs because the callback, above, is never invoked, so the wait condition is never reached.
    ###
    aws_condition_variable_wait_pred(&m_app_ctx.c_var, &m_app_ctx.mutex, credentialsPredicate, &m_app_ctx);
    aws_mutex_unlock(&m_app_ctx.mutex);    
    // ...
    return cleanup();
}
TingDaoK commented 1 year ago

Can you please share some log? You can enable log by adding

#include <aws/io/logging.h>

struct aws_logger_standard_options options = {
    .level = AWS_LL_TRACE,
    .file = stderr,
};
struct aws_logger logger;
aws_logger_init_standard(&logger, allocator, &options);
aws_logger_set(&logger);

Also, will it hang if you set up tls_ctx and still call aws_credentials_provider_new_chain_default instead?

daniel347x commented 1 year ago

Thanks again for looking into this!

(1) I attempted to enable logging, as you've indicated.

Unfortunately, though the code you provided to enable logging successfully builds, at runtime it segfaults. (Also, my version of the C++ compiler did not allow initialization of structs in the way your code does it, so it's a bit modified.)

Here is my logger initialization:

image

As noted, the above code builds, but at runtime, I see the "Logging for AWS set up" in the console, but then it segfaults.

Do you have any suggestions to prevent the segfault?

(2) I modified the code as you suggested (and as I knew you might ask). Here is my current code:

image

As you can see, I have (a) initialized a TLS struct; and (b) called 'aws_credentials_provider_new_chain_default', passing it the TLS struct.

Exactly the same hang occurs. The program simply hangs forever, at the same place I indicated previously.

Please let me know if there is anything else I can do to assist.

Thanks!

TingDaoK commented 1 year ago

Can you share the runtime segfault info, so that we can help to find out why you get the segfault?

For your info, @waahm7 has set up the environment as you did, but we cannot reproduce the issue from our end. We verified that the aws_credentials_provider_new_chain_default works well with sts_web_identity.

As we cannot reproduce the issue from our end, and we don't have any logs here to track through what's happening for you, we cannot do anything yet. So, please, provide the segfault info that we can try to help you to get the logs.

Here is my guess:

github-actions[bot] commented 1 year ago

Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.