Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.63k stars 832 forks source link

azkeys: each operation always emits 2 traces, the first one being a 401 failure #23480

Open ItalyPaleAle opened 3 weeks ago

ItalyPaleAle commented 3 weeks ago

Bug Report

I have an app that makes calls to Azure Key Vault's data plane using the azkeys SDK. These calls include wrapping and unwrapping keys using key material stored in the AKV.

Calls to AKV are authenticated using tokens issued by Azure AD. AKV is configured to use Azure RBAC for authz.

I have enabled tracing using OpenTelemetry, through the official "tracing/azotel" SDK.

When I look at the traces, I see that every time my app makes a request to AKV, there are 2 traces:

The operation overall completes successfully, and I understand by reading from issues such as https://github.com/Azure/azure-sdk-for-net/issues/4914 that the 401 failures are expected and are due to how data plane authentication for AKV works.

However, while the SDK itself hides the exceptions so they are not logged, it does not hide the traces containing the failed requests. Because of that, when I look at the traces for my app, I see a lot of failures which are "false positives" and cause noise.

Screenshot 2024-09-23 at 07 56 52
jhendrixMSFT commented 3 weeks ago

@lmolkova should we be suppressing the intermediate HTTP traces that are part of challenge auth?

lmolkova commented 3 weeks ago

Traces show the reality - the first request fails and that's why the second one is made. If it happens on every request, it's a good indication that SDK does not pre-populate cached token on the Authentication header.

Instead of hiding it from users (including additional source of latency and possible SDK bugs) we should optimize the SDK and make it reuse tokens. We'd reduce app latency and load on the backend service.

ItalyPaleAle commented 3 weeks ago

If it happens on every request

It does appear on every request, at least in my app.

I agree making it re-use tokens would be good.

I agree also that hiding the trace may be confusing. However, if the error is expected, could there be a way to not make it look like a failed request? If there's a failure in one trace, tools could highlight the error as a false positive, causing noise (and possibly incorrect alerts)

lmolkova commented 3 weeks ago

could there be a way to not make it look like a failed request?

yes, that should be possible. Core tracing policy needs some extra context from the upstream that would help it classify something as not an error.

E.g. bearer token policy can express that 401 (on the first attempt) is not an error. Methods like createIfNotExists in the client libs can express that 409 or similar is not an error.

However this is a tricky place. Application developers should look into the logical spans/metrics to get the failure rate (after all auth/tries) - there are always tons of transient issues in the high-scale network applications and if you look at them you see too much noise.

lmolkova commented 3 weeks ago

That's the .NET API that helps with classification that's reused between tracing and logs https://learn.microsoft.com/en-us/dotnet/api/azure.core.responseclassifier?view=azure-dotnet