Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.64k stars 844 forks source link

Improve DefaultAzureCredential diagnosability #15923

Closed chlowell closed 2 years ago

chlowell commented 3 years ago

DefaultAzureCredential only adds to its chain the credentials it can instantiate, and surfaces instantiation failures only when it can't instantiate any credential. That never happens because AzureCLICredential is at the end of the chain. This means that DefaultAzureCredential never surfaces problems with environment configuration or managed identity. Its constructor can always instantiate AzureCLICredential, so it never returns an error, and errors from DefaultAzureCredential.GetToken() never mention problems with environment configuration or managed identity because the corresponding credentials aren't part of the chain when such problems exist.

sadasant commented 2 years ago

I think it should be enough to return errors if there are errors at the end of NewDefaultAzureCredential: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/azidentity/default_azure_credential.go#L76-L85

How does that sound? I could be missing something šŸ¤”

chlowell commented 2 years ago

I worry that would imply the credential isn't useable. I expect applications to do this:

cred, err := NewDefaultAzureCredential(nil)
if err != nil {
    // assume cred is invalid

What I'm looking for here is more detailed log output when DefaultAzureCredential.GetToken fails. Every component of the default chain should be represented so a user who expected e.g. EnvironmentCredential to work gets some hint about why it didn't.

sadasant commented 2 years ago

Iā€™ll explore some ideas next week before I make a PR! Iā€™ll continue to comment here if I find anything interesting.

sadasant commented 2 years ago

I do believe that it is natural in Go to have good results besides errors. This is a fundamental part of the language that should be used as an advantage.

sadasant commented 2 years ago

AWS has custom error types with partial results: https://github.com/aws/aws-sdk-go/blob/main/service/s3/s3manager/upload.go#L44-L52

I donā€™t see a clear guide stating that valid results should not be sent in case of partial errors.

I will focus my efforts on logging for now in any case.

chlowell commented 2 years ago

The error message returned from GetToken is also important, probably more so than logging.

I still think we shouldn't return a non-nil error when the credential might work. Doing so would burden developers with figuring out whether the error matters. I imagine that would involve logic like "if the application is deployed to a VM, ignore the error unless it's about managed identity", which erodes the value of DefaultAzureCredential.

sadasant commented 2 years ago

Letā€™s say DefaultAzureCredential is deployed in a VM and it fails, checking the right error should be as easy as if errors.Is(err, ManagedIdentityCredentialError) {.

Moreover, without a concrete error for success cases, users wonā€™t be able to deploy programs reacting to specific authentication methods if they deploy to environments where more than one authentication is available.

Letā€™s say they deploy a DefaultAzureCredential to a VM where they also configured the environment variables used for EnvironmentCredential. They might not have intended to do this, but they will be left with a credential potentially authenticated with the wrong account. In some cases, I believe they might want to keep this setup; then their only recourse is to switch to the individual credentials.

If thatā€™s where we want to go (forcing developers to use individual credentials if DefaultAzureCredential is not enough for them), then thatā€™s ok, of course.

chlowell commented 2 years ago

Letā€™s say DefaultAzureCredential is deployed in a VM and it fails, checking the right error should be as easy as if errors.Is(err, ManagedIdentityCredentialError) {.

Well, there's no ManagedIdentityCredentialError. But if there were, I think expecting customers to handle spurious constructor errors makes DefaultAzureCredential too painful to use. This:

cred, err := NewDefaultAzureCredential(nil)
if err != nil {
if deployedToVM {
if errors.Is(err, ManagedIdentityCredentialError) {
// do something (what should an application do?)
}
if !errors.Is(err, EnvironmentCredentialError) {
log.Println("someone mistakenly set environment variables")
cred, err = NewManagedIdentityCredential(nil)
...
} else {
// do something else (what?)
}
}
}

is tantamount to this, which I would recommend instead:

if deployedToVM {
cred, err := NewManagedIdentityCredential(nil)
if err != nil {
// do something
}
} else {
cred, err := NewEnvironmentCredential(nil)
if err != nil {
// do something else
}
}

And we're begging the question of whether recovery is possible, though it's moot at construction anyway because constructors rarely return errors. Usually you learn your credential has a problem when one of your clients asks for a token.

If thatā€™s where we want to go (forcing developers to use individual credentials if DefaultAzureCredential is not enough for them), then thatā€™s ok, of course.

I want to go there. DefaultAzureCredential doesn't serve all purposes. If it doesn't serve yours, that's okay, please use something else šŸ˜„

sadasant commented 2 years ago

Iā€™m always torn by DefaultAzureCredential šŸ˜….. Iā€™ll focus my PR on logging šŸ‘ (I thought the failure error message was ok but Iā€™ll also review it again)

chlowell commented 2 years ago

Environment variable configuration illustrates my complaint. When DefaultAzureCredential.GetToken fails, I want the error message to say something about each inner credential, to help developers debug the ones they care about. However, when environment configuration is incomplete, NewEnvironmentCredential() returns a nil pointer which doesn't go on the chain, so ChainedTokenCredential's error message says nothing about environment variables.

JeffreyRichter commented 2 years ago

DefaultAzureCredential (DAC) exists for users to more easily get started with our SDKs. Customers should not use it in production environments for the reason you mention above: in a deployment, the app should know exactly what credential it is using and whether it is successful at creating/using it.

During development, the "contract" of DAC is that it is successfully created if it can latch on to any credential so an error should only be surfaced if no valid credential can be found. If a credential is found and then a problem arises, then an error can/should be surfaced but there is practically nothing a customer can do with it at this point. They'd probably have to reconfigure something and then re-run the app. For debugging purposes, it would be nice to indicate what's wrong to help the customer but there is no runtime altered code path that makes sense.

sadasant commented 2 years ago

Customers should not use it in production environments

While I agree with that, I am not sure if thatā€™s the intention across languages. Itā€™s showcased as if it should be used on production environments, I believe.

sadasant commented 2 years ago

Thank you both for your help! I understand I should focus on logging. Iā€™ll try to have a PR out this week!

sadasant commented 2 years ago

So, what about something like this? https://github.com/sadasant/azure-sdk-for-go/commit/68a1d0bd47a87e435899c299c5e749685a94438c

If the idea looks good, I would then polish it, add some tests and post a draft PR.

Please let me know if Iā€™m missing something.

chlowell commented 2 years ago

That gets "DefaultAzureCredential" into the message, which helps but doesn't address "missing" credentials. This is the error message I get today from DefaultAzureCredential.GetToken:

Chained Token Credential: Managed Identity Credential: IMDS token request timed outAzure CLI Credential: ERROR: Please run 'az login' to setup account.

If I believed I had set environment configuration, I'd wonder how managed identity and the CLI enter into it. With this issue I'm after something similar to the Python SDK's error, which looks like this:

DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
        EnvironmentCredential: Environment variables are not fully configured.
        ManagedIdentityCredential: No managed identity endpoint found.
        AzureCliCredential: Please run 'az login' to set up an account
JeffreyRichter commented 2 years ago

Can the error for EnvironmentCredential indicate exactly what env var name it is looking for? Anything to help customers self-diagnose any issues is a significant improvement in customer satisfaction.

chlowell commented 2 years ago

It could, but such a message is unwieldy when no variables are set because the credential requires one of 3 sets of overlapping variables. When we last reviewed errors across languages we decided to link to the troubleshooting guide instead (my Python example is pre-troubleshooting guide).

JeffreyRichter commented 2 years ago

The link is fine with me. Will you add it to the error message?

sadasant commented 2 years ago

(Thank you so much for the feedback! Iā€™m still catching up after resuming from the PTO. Iā€™ll give a more detailed answer soon)

sadasant commented 2 years ago

@JeffreyRichter yes, the idea is to use aka links to point people to the troubleshooting guide! I can make the troubleshooting guide for go as part of this effort (copying other languages) šŸ‘

sadasant commented 2 years ago

So, I think linking to the troubleshooting guidelines is something I can do right after resolving this specific issue. In that regard, I have a question:

Besides that, I have improved my previous attempt taking in consideration the example that @chlowell shared earlier in this thread: https://github.com/Azure/azure-sdk-for-go/compare/main...sadasant:fix15923?expand=1 How does this look? Iā€™ll ping Charles over teams as well.

chlowell commented 2 years ago

Is there an issue regrading whatā€™s missing to align Go with other languages on the usage of the troubleshooting guideline for Identity?

16796 tracks any content revisions to help support agents onboard. There's no issue tracking linking to the guide in error messages.

Besides that, I have improved my previous attempt taking in consideration the example that @chlowell shared earlier in this thread: https://github.com/Azure/azure-sdk-for-go/compare/main...sadasant:fix15923?expand=1 How does this look?

Looks reasonable but let's not add any public API, e.g. ChainedTokenCredentialOptions.ChainedCredentialName

sadasant commented 2 years ago

There's no issue tracking linking to the guide in error messages.

I have made a new issue! https://github.com/Azure/azure-sdk-for-go/issues/16843 If that helps šŸ™‚

sadasant commented 2 years ago

@chlowell I have made a PR šŸ™‚ https://github.com/Azure/azure-sdk-for-go/pull/16844