Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
740 stars 494 forks source link

Hooks for detecting client side latency / high CPU #1520

Open arekpalinski opened 4 years ago

arekpalinski commented 4 years ago

We'd like to investigate the ability to introduce detection of client side latency to our profiling tool.

The idea is that we could check the latency before an operation called by a user (e.g. UpsertItemAsync) and then when the task is actually starting to execute.

If the client machine is busy / has high CPU then we could provide an alert to a user. So they would be aware that issue is on the client side.

I already asked about IClientSideRequestStatistics.IsCpuOverloaded in https://github.com/Azure/azure-cosmos-dotnet-v3/issues/1450 but it doesn't seem to be sufficient.

I'd like to have ability to subscribe to actions performed via the cosmos client. It doesn't have to be public - it could be internal hooks. I'd be happy to create a PR but first I want to check the possibility of introducing something like that. Also would be nice to get some guidance to make the PR to be accepted.

arekpalinski commented 4 years ago

The hooks would be also helpful when providing to a user the stacktrace of a method calling the cosmos db operation:

image

If an action is scheduled just using Task.Run():

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/46e8acabf6d7ec85ea630008a68c4856a2924938/Microsoft.Azure.Cosmos/src/TaskHelper.cs#L96

Then we can't provide the stacktrace of original caller and the actual line where the operation was called.

j82w commented 4 years ago

Hi @arekpalinski ,

Have you looked into using handlers? All user operations will go through the handler, but some caches still haven't been updated to use the new pipeline. This should provide you with the capability to log all the different operations. Does this work for your scenario?

arekpalinski commented 4 years ago

Handlers look very nice. We'd need to add a custom handler on the fly to an already initialized instance of CosmosClient since the profiling is initialized as follow:

HibernatingRhinos.Profiler.Appender.CosmosDB.CosmosDBProfiler.Initialize(cosmosClient);

where cosmosClient is CosmosClient instance. But that doesn't seem to be an issue I think. So far so good.

Although that wouldn't still work with Task.Run() calls that are called in TaskHelper e.g. when running on classing ASP.NET where SynchronizationContext.Current is AspNetSynchronizationContext:

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/46e8acabf6d7ec85ea630008a68c4856a2924938/Microsoft.Azure.Cosmos/src/TaskHelper.cs#L90-L96

Those calls can be still waiting in the thread pool queue.

j82w commented 4 years ago

That's a good point. The CosmosDiagnostics.GetClientElapsedTime should include the thread pool queue time. I'll create a backlog item to fix this.

arekpalinski commented 4 years ago

Thanks! Would the thread pool queue time be an additional scope in CosmosDiagnosticsContext?

Do you have any idea / suggestion how to gather the original user's stacktrace then? Once Task.Run() is called we loose the ability to show to user his code which spawned a Cosmos operation.

j82w commented 4 years ago

One option is with PR #1489 a custom property can be added to correlate to request to the handler call.

arekpalinski commented 4 years ago

My problem is that the profiler usage should be transparent to a user and should not force the user to make any changes in this code. In other words I don't see a place I could gather the stacktrace before Task.Run() is called here: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/46e8acabf6d7ec85ea630008a68c4856a2924938/Microsoft.Azure.Cosmos/src/TaskHelper.cs#L96

Although I see that you're introducing ClientContextCore.OperationHelperAsync in https://github.com/Azure/azure-cosmos-dotnet-v3/pull/1555

Is there any chance to add some event there that I could subscribe to: https://github.com/Azure/azure-cosmos-dotnet-v3/pull/1555/files#diff-b0b1db9824d109ee3f143ff104894c19R190

I think is would be exactly what I'm looking for - just before TaskHelper.RunInlineIfNeededAsyn so before calling Task.Run() under the covers.

Note, I doesn't have to be part of the public API - I'll manage to access it once it's internal with the magic of IgnoresAccessChecksToAttribute and compile it dynamic on the fly.

arekpalinski commented 4 years ago

@j82w Did you have a chance to look at my last comment?

j82w commented 4 years ago

I did some testing, and the Task.Run should only be used in places where there are is a custom synchronization context which we've mainly seen in older ASP.NET applications.

The only way solution I found around this is do a custom wrapper around Container like Encryption SDK is doing.

j82w commented 4 years ago

In 3.10.1 release CPU usage should now be included all the diagnostics, but it is a best effort scenario. If getting the CPU fails it will stop trying to avoid impacting users.