Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
731 stars 489 forks source link

Endless GET traces on ApplicationInsights #2919

Open Ettores88 opened 2 years ago

Ettores88 commented 2 years ago

We are experiencing strange behavior in our different environments even if they are equally configured. Thanks to ApplicationInsights we encountered traces like: 144460126-7c8aa127-1f09-4d84-b6c2-6d10ad26f90d-edit

The services create the CosmosClient as follow (DatabaseClientHelper is used to create a Singleton of the DatabaseClient):

public static class DatabaseClientHelper
{
    public static Database CreateDatabaseClient(string uri, string authKey, string databaseName)
    {
        var cosmosClient = new CosmosClientBuilder(uri, authKey)
            .WithConnectionModeDirect()
            .WithRequestTimeout(TimeSpan.FromSeconds(10))
            .WithThrottlingRetryOptions(
                maxRetryWaitTimeOnThrottledRequests:TimeSpan.FromSeconds(10),
                maxRetryAttemptsOnThrottledRequests: 5)
            .WithCustomSerializer(new P0NewtonsoftSerializer())
            .Build();
        return cosmosClient.GetDatabase(databaseName);
    }
}

public class P0NewtonsoftSerializer : CosmosSerializer
{
    private readonly JsonSerializer _serializer;

    public P0NewtonsoftSerializer()
    {
        _serializer = JsonSerializer.Create(
            new JsonSerializerSettings()
            {
                Converters = new List<JsonConverter>() { new StringEnumConverter(), new MultiAggregateRootConverter() }
            });
    }

    public override T FromStream<T>(Stream stream)
    {
        using StreamReader reader = new StreamReader(stream);
        using JsonTextReader jsonReader = new JsonTextReader(reader);
        return _serializer.Deserialize<T>(jsonReader);
    }

    public override Stream ToStream<T>(T input)
    {
        MemoryStream stream = new MemoryStream();
        using StreamWriter writer = new StreamWriter(stream, leaveOpen: true);
        using JsonTextWriter jsonWriter = new JsonTextWriter(writer);
        _serializer.Serialize(jsonWriter, input);
        jsonWriter.Flush();
        stream.Seek(0, SeekOrigin.Begin);
        return stream;
    }
}

We extracted from AI how many traces of that GET we have on our environments. This is the Kusto query used:

union isfuzzy=true dependencies
| where timestamp > ago(365d)
| where type in ("Azure DocumentDB")
| where name in ("GET")
| order by timestamp desc
| summarize count() by bin(timestamp, 1d)
| render timechart 

And these graphs are the results from our three developments environments: image image image

Our stage environment (we deploy every 2 week, the last deploy was on 02/12, only today we started experiencing so many traces of GET): image

This is from the production environment: image

Could you help us understand the root cause of these different traces and how to configure our environments to prevent these endless GET traces?

j82w commented 2 years ago

Is the SDK configured to use gateway mode?

Ettores88 commented 2 years ago

Sorry, but I didn't get your question. We are in DirectMode, we initialize the client with:

var cosmosClient = new CosmosClientBuilder(uri, authKey)

            .WithConnectionModeDirect() //<--- 

            .WithRequestTimeout(TimeSpan.FromSeconds(10))
            .WithThrottlingRetryOptions(
                maxRetryWaitTimeOnThrottledRequests:TimeSpan.FromSeconds(10),
                maxRetryAttemptsOnThrottledRequests: 5)
            .WithCustomSerializer(new P0NewtonsoftSerializer())
            .Build();

Isn't this enough? Do other configurations need to be set?

j82w commented 2 years ago

Sorry, I missed that it was set to direct mode in the sample.

The Cosmos DB SDK should only do http calls to initialize and update the caches. Are you by any chance doing a database or container operation like read or create if not exists on every item operation?

Ettores88 commented 2 years ago

No, we decided to have manual control over these resources. This helper is called in startup and register as a singleton the Document Client. The base class used to operate on the container open the client in this way:

protected MyBaseClassRepository(
    Database db,
    string collectionName)
{
    _container = db.GetContainer(collectionName);
}

The repositories are InstancePerLifetimeScope with Autofac.

ealsur commented 2 years ago

Your code is creating one Client per database name, are you handling multiple databases in your application? Ideally you should have 1 client per account, the client can be used to interact with any database and container within the account.

Also, how frequently is the GET you are seeing in the screenshot? If it's every 5 minutes, the SDK has a background Task that refreshes the account information periodically. I don't know why App Insights renders it in a nested way, we don't have any App Insights specific code or references, this is completely how App Insights interprets the requests.

Ettores88 commented 2 years ago

I confirm that they are 5 minutes each other. But why are they linked to the same operationId of my operation and why does this behavior differ in the different environments?

Your code is creating one Client per database name, are you handling multiple databases in your application? Ideally you should have 1 client per account, the client can be used to interact with any database and container within the account.

Each microservice will manage only one database, this is reason why we initialize a Database client in the startup and register it as a Singleton. From that Database client, each "Repository" gets its client to the Container.

ealsur commented 2 years ago

I confirm that they are 5 minutes each other. But why are they linked to the same operationId of my operation and why does this behavior differ in the different environments?

This seems to be a behavior of App Insights. The background refresh is not related to any user operation, it is an independent Task that runs on the background and performs 1 HTTP GET every 5 minutes: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs#L462

Each microservice will manage only one database, this is reason why we initialize a Database client in the startup and register it as a Singleton. From that Database client, each "Repository" gets its client to the Container.

Ideally you'd like to have the CosmosClient as the Singleton, because your model works well because the application accesses 1 Database, but if at some point it accesses 2 or more, you are now generating more clients than needed. If you have the CosmosClient as Singleton, you can always do client.GetContainer(databaseName, containerName).

j82w commented 2 years ago

Do all of the requests have the same activity id?

Ettores88 commented 2 years ago

Ideally you'd like to have the CosmosClient as the Singleton, because your model works well because the application accesses 1 Database, but if at some point it accesses 2 or more, you are now generating more clients than needed. If you have the CosmosClient as Singleton, you can always do client.GetContainer(databaseName, containerName).

I know, I have taken this simplification just because of the architectural choice. I wanted to hide the concept on Cosmos instance, on top of the Database, to the dev team that will implement the repositories. But thanks for letting me know how to use it in the best way.

This seems to be a behavior of App Insights. The background refresh is not related to any user operation, it is an independent Task that runs on the background and performs 1 HTTP GET every 5 minutes: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs#L462

From a client-side perspective, nothing is changed, we didn't update the ApplicationInsights SDK. And I don't think that this could be related to a specific change on the server-side. So my doubts remains:

Do you suggest opening an issue on AI SDK to investigate it on their side?

Ettores88 commented 2 years ago

Do all of the requests have the same activity id?

No, some operation has these loop inside that ends without an explicit reason. I've tried a new query to get a count grouped by operation id.

image

ealsur commented 2 years ago

From a client-side perspective, nothing is changed, we didn't update the ApplicationInsights SDK

Did you update the Cosmos DB SDK? If not, then the Cosmos SDK behavior did not change either, this background refresh has been there since the beginning. If nothing on the client application changed and this started to appear on the AI UI suddenly, then it means something changed on AI that is now identifying this activity as related to user requests, when it's not.

Ettores88 commented 2 years ago

We updated our Cosmos SDK on the date 2021-11-30 (but from the traces the problem was preset way before) and the ApplicationInsight SDK on the date 2020-12-18 (I know, it is time to update). So the updates aren't the problem. All our environments have almost the same creation date and the same configuration.

It seems that there is no reason why we experience these different behaviors.

I could simply filter out these diagnostics with an ITelemetryProcessor but it worries me that I didn't understand the root of this.

ealsur commented 2 years ago

When the SDK initializes (happens on the first request ever), that's when the background worker starts. It is easy to identify these requests because they are HTTP GETs to the / of the account. So if the account is myaccount.documents.azure.com you'd see HTTP GET to https://myaccount.documents.azure.com:443/ (no other paths in the Url).

The only logical explanation is that AI is, in their engine, correlating the first ever request of a new client instance with the kickoff of the background worker and then it correlates all future executions of the worker with that initial first request.

One way to rule that out would be to do a dummy request when you create the client (outside of your hot path), or use the optimization of the client that warms up the client before any requests (https://devblogs.microsoft.com/cosmosdb/improve-net-sdk-initialization/).

If doing any of those things gets rid of the ever-growing nested requests, then it confirms AI is correlating the background worker work with the first operation.

Ettores88 commented 2 years ago

I've experimented a bit with one of our microservices in a DEV environment. The dummy request seems working. I continue to not understand why in production it is different. The instances are always created by a real request.

But if this could solve the problem I will had to study how could we implement the CreateAndInitializeAsync. This API wants the list of containers. But when we create the client is much earlier and the containers used aren't known a priori. Our central component that is provided to the different teams is the one that registers the Database client. They use this instance to get a reference to the Container in their repositories. So a good amount of work is needed.

The CreateAndInitializeAsync isn't also available with the CosmosClientFactory so I had to remove the fluent initialization that is much more elegant.

I need more time to be sure that this is really the solution.

ealsur commented 2 years ago

The instances are always created by a real request.

I think that is the culprit. Since the instance is created (maybe you are using DI?) on demand on a real request (maybe coming to your API/web project). App Insights automatically tracks all incoming requests to your app and tracks the time/performance/metrics/dependencies. That real request is instantiating your CosmosClient, performing the initialization of the client, starting the background thread to refresh, and then performing whatever operation you need to do to serve that real request. Since AI tracks all activity automatically, it seems to be detecting, as part of the real request, the start of the background refresh, and since this background refresh never really ends (it keeps happening every 5 min), it ties all future refreshes to the original real request telemetry and that is why it is tracking it.

Ettores88 commented 2 years ago

Sorry for the long wait. After some time I have more information. I confirm that as proposed, creating a dummy request detaches all this "GET" from the functional request served by the microservices. This doesn't mean that these dependencies aren't tracked, ApplicationInsights will trace the "GET" even if these are performed out of request context, each one will have a new operation id.

Moving from version 3.15 to 3.23 we experienced a drastic increase of "GET" dependencies even in PRODUCTION. Until the new release, we had seen fewer traces. I haven't deepened my understanding of the linked section of code, but I see that a lot of work was done in that class from 3.15 to the actual 3.23.

Right now all is registered with DI and the resources are created only when a request income, the use of CreateAndInitializeAsync will require a lot of rework, so we will probably introduce a filter to get rid of them manually. The fact that traces are on logs anyway even if with different operation id could harden analysis creating background noises.

But I understand that this is a necessary behavior for the Cosmos SDK, and for some reason, someone could want these traces for debugging unexpected scenarios (it is correct that ApplicationInsights SDK traces them too). So I think that is our specific problem. From my point of view, the issue could be closed, at least that you want to investigate how could be possible to increase the experience from the final user.

YohanSciubukgian commented 2 years ago

Having all those logs has cost implication. Could we have a way to limit logs (LogLevel applied to CosmosDb SDK via a configuration ?) to only get valuable logs for end users ?

ealsur commented 2 years ago

@YohanSciubukgian which logs are you referring to? The SDK produces Event Traces with different log levels, users can configure their Event Listeners to listen for a particular level. AI tracing/logging HTTP GET operations might be a different thing, these are not logs, these are background requests that are expected to happen, if AI is logging them, I don't know if it allows to filter them. My guess is that this is a by-design behavior to log all HTTP requests.

YohanSciubukgian commented 2 years ago

@ealsur Thank you for the detailed answer. You answered my question. I was wondering if we could configure the log level for event traces. Thanks 😊

Barsonax commented 2 months ago

Iam still seeing this, I think the problem is that the parentid and operationid is being captured in the cosmosdb refresh check which causes the logs to be correlated forever. This really messes up the transaction view in app insights now, things that took only a view minutes are now displayed like they took hours or days because of this correlation with alot of additional noise.

I think the fix should be that the refresh it should start its own operation instead of capturing the context it is in as the refresh calls (especially after the first) are not logically part of what started the loop.

I also don't think this issue has anything to do with app insights as I see it locally in the aspire dashboard (we are using opentelemetry) as well.

andrasg commented 2 months ago

I ran into this issue as well using the OpenTelemetry instrumentation. My CosmosClient is injected using DI when needed (typically on first inbound API call). This makes all 5-minute refresh activities inherit the original trace-id (correlation id) and skews telemetry very much.

Perhaps the SDK would need to disconnect the Activity for a refresh cycle from its parent by calling activity.SetParentId()?

andrasg commented 2 months ago

Addig a potential master issue: https://github.com/open-telemetry/opentelemetry-dotnet/issues/984

The OpenTelemetry dotnet SDK has some example code working around the issue:

https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/trace/links-creation-with-new-activities