Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
741 stars 494 forks source link

[Internal] Adding Client Telemetry #2323

Closed sourabh1007 closed 2 years ago

sourabh1007 commented 3 years ago

Is your feature request related to a problem? Please describe. Client requested for request charge API for CosmosException (ref. https://github.com/Azure/azure-sdk-for-java/issues/13488)

Describe the solution you'd like Develop a telemetry module which collects and save certain kind of metrics periodically. It is already developed for Java SDK (ref. https://github.com/Azure/azure-sdk-for-java/pull/16822) and need to be implemented for .net SDK also.

Describe alternatives you've considered There is no alternative for this.

Additional context Default client telemetry capturing is off (User can turn on via builder API)

It contains below aggregations

SDK will collect and store below information

timeStamp | Time of capture
clientId | Unique client identifier
processId | host machine identifier
userAgent | User agent container SDK and machine OS version
connectionMode | Direct Vs Gateway
globalDatabaseAccountName | User account name
applicationRegion | User host region if in azure
hostEnvInfo | Host machine information
acceleratedNetworking | If azure vm with accelerated networking
systemInfo | CPU/Memory/Rntbd connection Info
cacheRefreshInfo | Client cache refresh information
operationInfo | CRUD/Query operation aggregation information

JSON would look like this : image image image image

j82w commented 3 years ago

Client requested for request charge API for CosmosException (ref. Azure/azure-sdk-for-java#13488)

This is already done in the .NET SDK: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/f16153b98627890f3328b3c4b3d2d3c2265f22cf/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs#L97

Can you provide an example with values for the information being sent back?

HostEnvInfo // What information is included in this? AcceleratedNetworking // How can this be detected? GreaterThan1Kb // Is this a percentage? Count? Bool? Is it for both request and responses? Consistency // Is the SDK consistency? Is it null if they just use the account default?

// Is this all for latency? MetricsName UnitName Count Mean Min Max Percentile50 Percentile 90 Percentile 95 Percentile 99 Percentile 999 JsonColumn

sourabh1007 commented 3 years ago

I have updated the schema in the main issue along with sample JSON.

kirankumarkolli commented 3 years ago

Few clarifications on the wire format

  1. Nesting percentiles: Why are they nested?
  2. Greaterthan1KB: won't making it a text-histogram be more useful? Or in other words what scenarios would boolean address.
  3. RegionsContacted: will per region be a better alternative? (Like what % requests routed to other regions)

Eventually do client expected to post metrics to the current one in PreferredList configured?

sourabh1007 commented 3 years ago
  1. It represents out of 100 requests, how many requests took under how much time. e.g. 99.9: 63.93 90.0: 39.18 95.0: 63.93 99.0: 63.93 It means, 99.9 requests took less 63.93 sec to get processed. 90.0 requests took less than 39.19 sec and so on. That's why they are nested.
  2. This particular field was requested from "Monitoring Team". They just wanted to know if response size was less than 1kb or not.
  3. Here we will just provide the information about the region contacted. Any kind of analytics can be done on kustos as this information will be available in Kustos.