Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
743 stars 496 forks source link

AI Integration: OTel CosmosDB Attributes to Collect #3058

Closed sourabh1007 closed 9 months ago

sourabh1007 commented 2 years ago

Status: Experimental Target SDK: .Net SDK

Operations To Cover

Database / Container Operations : Cosmos.CreateDatabaseAsync / Cosmos.CreateContainerAsync / Cosmos.DeleteStreamAsync Point Operations : Cosmos.CreateItemAsync / Cosmos.ReadItemAsync / Cosmos.UpsertItemAsync / Cosmos.ReplaceItemAsync / Cosmos.PatchItemAsync / Cosmos.DeleteItemAsync Stream Operations : Cosmos.CreateItemStreamAsync / Cosmos.UpsertItemStreamAsync / Cosmos.ReadItemStreamAsync / Cosmos.ReplaceItemStreamAsync / Cosmos.PatchItemStreamAsync / Cosmos.DeleteItemStreamAsync Batch Operations : TBD Bulk Operation : TBD Query Operations : Cosmos.Typed FeedIterator ReadNextAsync (for each page)

Attributes

AI Default Attributes:

Attributes Value
Event time 3/2/2022, 11:58:04.967 PM (Local time)  
Duration 278.0 ms  
Name Cosmos.DeleteStreamAsync  

Other default attributes are there e.g device details

Proposal For .Net Cosmos DB SDK

Custom Attributes:

Azure Specific Attributes
Attribute Value Comment
kind client IGNORE, By Default, setting them as part of diagnostic scope
az.namespace Microsoft.DocumentDB IGNORE, By Default, setting them as part of diagnostic scope
Common Database Attributes
Attribute Value Comment
db.system cosmosdb Open Telemetry Convention To identify type of Db ref. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md#notes-and-well-known-identifiers-for-dbsystem
db.name < Database Name >
db.operation ReadItemAsync, DeleteItemStreamAsync etc Database Operation ~Type~ Name
net.peer.name e.g. sourabhjaintemp Account Name + Cloud
Cosmos DB Specific
Account Level Information: Attribute Value Comment
db.cosmosdb.client_id Unique Client Id Combination of client id and machine id can tell us, if customer is following best practices to create singleton client
db.cosmosdb.machine_id Unique Machine Id
user_agent.original < User Agent With SDK version> Useful to identify the SDK version
db.cosmosdb.connection_mode Direct/Gateway go through

Request Level Information:

Attribute Value Comment
db.cosmosdb.container Container
db.cosmosdb.request_content_length_bytes Size of request payload
db.cosmosdb.response_content_length_bytes Size of response Payload
db.cosmosdb.status_code 201/200/204 Cosmos Db Http Status Code, it tells if particular cosmosdb call/request is passed/failed with which HttpStatusCode
db.cosmosdb.sub_status_code 1000/1002 Cosmos Db SubStatus Code
db.cosmosdb.request_charge < double type number > RU consumed for that operation
db.cosmosdb.regions_contacted Region Cosmos Db
db.cosmosdb.retry_count Number of retries
db.cosmosdb.operation_type Query/Read/Create
db.cosmosdb.item_count < int number> Number of items returned by the operation, only Feed Operation
db.cosmosdb.request_diagnostics < JSON String> ~Open Question: What if this string is out of the limit size? will appinsight will break it and divide into different attributes?~ Generating it as event
db.cosmosdb.activity_id Guid Unique Id for the operation and can be helpful to debug particular operation in backend logs
db.cosmosdb.correlated_activity_id Guid It will be populated only in case of query operation to allow correlating query pages retrieved for the same multi-page or cross-partition query.
db.cosmosdb.batch_operations string Comma separated list of operation type and count, only for batch operations
Open Telemetry Standard for any Exception
Attribute Value Comment
exception.type java.net.ConnectException; ``OSError ref. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/exceptions.md
exception.message Division by zero; Can't convert 'int' object to str implicitly ref. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/exceptions.md
exception.stacktrace Exception in thread "main" java.lang.RuntimeException: Test exception\n at com.example.GenerateTrace.methodB(GenerateTrace.java:13)\n at com.example.GenerateTrace.methodA(GenerateTrace.java:9)\n at com.example.GenerateTrace.main(GenerateTrace.java:5) ref. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/exceptions.md

Following Open Telemetry Conventions (Note: Status of below conventions is Experimental): 1. https://github.com/Azure/azure-sdk/blob/main/docs/tracing/distributed-tracing-conventions.yml

  1. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md
  2. https://opentelemetry.io/docs/reference/specification/common/attribute-naming/
FabianMeiswinkel commented 2 years ago

Few comments:

sourabh1007 commented 2 years ago

Connection Mode - Db URL is actually just an Account Name, So you can't actually identify if it is direct or gateway mode by this url. User Agent : user agent string can be large but it can not be very large (as we are limiting the size due to other constraint we have). It is something already generated at the time of client instance creation So overhead is just to put it into dictionary.

Added list of Default Information

lmolkova commented 2 years ago

Thank you for summarizing it!

I have a couple of comments:

  1. Http attributes: we should not use HTTP attributes on DB calls. If there are nested HTTP calls, they will be traced automatically and will have all required attributes. So I'd suggest removing below attributes. If it's really helpful to have them and they apply to DB call, we should populate them with db prefix and pick a right name

    • http.status_code
    • http.user_agent
    • http.request_content_length
    • http.response_content_length
  2. db.url - based on conversation with AzMon team, we should use Microsoft Common Schema attributes which means we should populate net.peer.name and net.peer.port (optional if default) instead of url. Path fragments (if important) should appear as attributes (db.name is already here, should we also add collection name as db.cosmosdb.collection?)

sourabh1007 commented 2 years ago

I think we can make both the changes so replacing http. with db.cosmosdb. @kirankumarkolli @FabianMeiswinkel do you guys have any question or concern?

FabianMeiswinkel commented 2 years ago

I disagree on replacing db.url with net.peer.name and net.peer.port - the url contains also critical information like partitionId and replicaId = the dns-hostname/ip + port is only pointing to the transport endpoint which can be considered a load pbalancer in front of the pods used for the replica So min bar we also add PartitionId and ReplicaId - but even then the semantic meaning of the ip+port is probably very idfferent than what the common schema assumes (the link points to a non-existing repo - so couldn't check)

lmolkova commented 2 years ago

the link points to a non-existing repo - so couldn't check

@FabianMeiswinkel you need to join Microsoft org to see this repo: https://repos.opensource.microsoft.com/orgs/microsoft

I disagree on replacing db.url with net.peer.name and net.peer.port - the url contains also critical information like partitionId and replicaId = the dns-hostname/ip + port is only pointing to the transport endpoint which can be considered a load pbalancer in front of the pods used for the replica

The critical information should go to dedicated attributes - let's list what's critical. We could also keep db.url, burt have to populate net.peer.name and net.peer.port too.

The reason is that APM tools don't know what db.url is and even if we convince AzMon to understand it, other tools like Splunk, Dynatrace, AppDynamics, NewRelic, Lightstep, Honeycomb, you name it would not be able to properly show Cosmos. Following conventions is essential to make sure all backends show Cosmos reasonably well.

If we put critical info into dedicated attributes instead, we'd allow users to query by partitionId or replicaId. As long as those are part of url, we won't have this capability with an arbitrary tracing tool.

sourabh1007 commented 2 years ago

@FabianMeiswinkel This db.url won't have replicaid and partition id information. This will be just account name/url.

sourabh1007 commented 2 years ago

@lmolkova do you mean db.cosmosdb.container by db.cosmosdb.collection? Container Name we are already appending with database name.

lmolkova commented 2 years ago

@lmolkova do you mean db.cosmosdb.container by db.cosmosdb.collection? Container Name we are already appending with database name.

container vs collection - let's follow the terminology you have - I have no opinion.

db.name | < Database Name and Container Name >

why so? I assume it could be useful to write queries per container or per db and also it contradicts OTel conventions - check out what Mongo does there.

Any reason why we want to keep them in a single attribute?

sourabh1007 commented 2 years ago

Let me summarize the changes here:

  1. net.* vs db.url : Cosmos Db Url is just a cosmosdb account name, where it goes to get the metadata information only once during initialization (and while refreshing this metadata information). After that all the "real calls" happen at different host/port. So using net.* attribute, will give wrong impression that calls are going to that particular URL. That's why we want to have it as separate property. I have renamed is to db.cosmosdb.account
  2. Added new attribute db.cosmosdb.partitionId, as we think that it is useful to have this information to debug certain scenarios.
  3. Separate out db name and container name, added new attribute to have container name db.cosmosdb.container

@kirankumarkolli @FabianMeiswinkel @lmolkova

lmolkova commented 2 years ago

Thank you @sourabh1007, it looks good!

I few minor things

  1. az.namespace must be a registered Azure resource provider namespace - AzMon will be able to use it to navigate to the resource in Azure i.e. it should be Microsoft.DocumentDB instead of Microsoft.Azure.Cosmos
  2. db.cosmosdb.account:
    • should the example be sourabhjaintemp instead of https://sourabhjaintemp.documents.azure.com/?
    • Or perhaps db.cosmosdb.host if you want to have domain name in it sourabhjaintemp.documents.azure.com?
    • The best option is to call it net.peer.name - you have to populate this standard attribute anyway so AppMap can work properly in the Portal
  3. Please populate net.peer.name - it's essential for Azure Monitor AppMap and any other 3P tracing tool. If you provide custom attribute for it, visualization tools won't be able to show the cosmos destination correctly.
  4. Do you believe content length (db.cosmosdb.request_content_length, db.cosmosdb.response_content_length) attributes are helpful? If the confidence is low, I suggest to defer adding them - it's never too late to add, but impossible to remove
  5. db.cosmosdb.status_code appears twice. Please don't include HTTP status code on CosmosDB spans - HTTP requests will be traced anywhay and will have status on them.
  6. db.cosmosdb.error - there is a span/scope status to indicate success/failure - can we remove this one?
sourabh1007 commented 2 years ago
  1. Makes Sense
  2. db.cosmosdb.account yes example should be sourabhjaintemp.
  3. Cosmosdb doesn't make any network call to a particular URL (it makes call to different URL on the basis of metadata information) , I don't think net.peer.name makes sense here. If it is something which is mandatory, then may be we can decide some workaround (or most close value to have).
  4. Content Length are helpful to understand the payload size, there are some error prone scenarios which are related to this and most importantly, our SLA is dependent on this.
  5. Working on it.
  6. Yes we can remove it as of now.
lmolkova commented 2 years ago

Thanks @sourabh1007

Cosmosdb doesn't make any network call to a particular URL (it makes call to different URL on the basis of metadata information) , I don't think net.peer.name makes sense here. If it is something which is mandatory, then may be we can decide some workaround (or most close value to have).

Information that appears on net.peer.name /ip/port represent remote destination. Application Map will show cosmos node based on it and would allow customers to see which CosmosDB (account) they connect to. They don't have to be precise network destinations, it's enough to have net.peer.name set to sourabhjaintemp.documents.azure.com (or regional endpoint if you want users to distinguish them). I hear your concerns about the name, but if you use a different attribute, we'll have a hard time with both 1st and 3rd party tools visualizing it properly.

lmolkova commented 2 years ago

net.peer.name

Can we use a full hostname like sourabhjaintemp.documents.azure.com? Could be important for non-global, non-public clouds.

Also, is there ever a case where non-default port is used on client side? If so, we should also add net.peer.port when it happens.

FabianMeiswinkel commented 2 years ago

net.peer.name

Can we use a full hostname like sourabhjaintemp.documents.azure.com? Could be important for non-global, non-public clouds.

Also, is there ever a case where non-default port is used on client side? If so, we should also add net.peer.port when it happens.

@lmolkova we intentionally used an identifier which can't be resolved into a real hostname/port. The traces above are logical requests (they might end-up hitting x endpoints). net.peer.name and net.peer.port form the name (and semantic usage of how other databases are using it) indicates that this would be a physical network endpoint. You mentioned that an identifier that can be used to identify a resource for Cosmos is required. So we used an identifier that identifies the account - but to avoid confusion used one that clearly is not a physical networking endpoint. Reason why this seems important is that the vast majority of physical communication is not going to this DNS hostname + port (but the actual backend replica/shards)

lmolkova commented 2 years ago

@FabianMeiswinkel, thanks for the context and it makes total sense. My only worry is that sourabhjaintemp does not include any identifier of the cloud. We can leave it there for now and later come up with a cross-service attribute for the cloud that can be omitted for a global one to save some bandwidth.

FabianMeiswinkel commented 2 years ago

I am totally fine with adding a prefix/suffix for cloud - just want to avoid that anyone would think this is a real network endpoint

lmolkova commented 2 years ago

Been playing with attributes, a few more points:

sourabh1007 commented 2 years ago

1.db.cosmosdb.machine_id: we can remove it..if we are able to filter and identify if failures are coming from same machine or different machine.

  1. db.cosmosdb.regions_contacted : CosmosDb call can failover to mutiple regions. This value can help us to identify if problem is in some particular region. So cover those scenarios where we are expecting/not expecting failovers. We would need an ability to filter on this attribute in order to this analysis.
  2. db.operation : in CosmosDb we don't command, we just have function name which gets call and Operation type.
FabianMeiswinkel commented 2 years ago

Been playing with attributes, a few more points:

  • db.cosmosdb.machine_id - all APM tolls usually have some unique instance id. AzureMonitor has it too, there is no need to define your own attribute.
  • db.cosmosdb.service_endpoints_contacted - should it be an event attribute? Do you expect any special treatment for this property?
  • db.cosmosdb.regions_contacted - what data you expect to be there?
  • db.operation (Create/Read/Upsert/Replace/Patch/Delete/Query) seems to be inconsistent with OTel conventions, in case of Mongo, it's command, something like findAndModify. i.e. having DeleteStreamAsync instead of Delete there would make more sence
  • db.cosmosdb.request_diagnostics - do we want to stamp it as an attribute? If it's an event, let's describe it as an event, not attribute.

db.cosmosdb.machine_id - the captured machine Id if available will be the "Azure VM Unique ID" (https://azure.microsoft.com/en-us/blog/accessing-and-using-azure-vm-unique-id/) - main purpose to try to capture this identifier is to simplify getting the containerId and/or nodeId which are required to start any Network Investigation (via aka.ms/netvma or when you try to engage the CloudNet teams). If Azure monitor has a machine identifier for which there is a reaosnably simplye way to get containerId/nodeId we are good - otherwise I think we should keep it here. Because getting customers to identify the containerId/nodeId in netwrok/connectivity incidents has been really painful in the past.

lmolkova commented 2 years ago

main purpose to try to capture this identifier is to simplify getting the containerId

Azure Monitor supports reading Azure-service specific identifiers (on Functions, AppServices, Azure VMs) and falls back to machine name, which in container environments, I believe, is container id. There is no need to replicate this complicated logic.

FabianMeiswinkel commented 2 years ago

main purpose to try to capture this identifier is to simplify getting the containerId

Azure Monitor supports reading Azure-service specific identifiers (on Functions, AppServices, Azure VMs) and falls back to machine name, which in container environments, I believe, is container id. There is no need to replicate this complicated logic.

Agreed as long as there is really an easy way to get containerId/nodeId from what is getting logged. I can just tell form past experience that neither on Azure Functions (automatic logging) nor App Services (manual logging) there has been an easy way for customers to find this info when they logged to AppInsights. I am happy to test this if someone from AzureMonitor can provide the Kust query or similar that should reliably resolve nodeId/containerId from the attributes in the base schema. If that works, we can definitely drop the machine_Id attribute here.

sourabh1007 commented 9 months ago

Closing as it is already implemented.