Closed sourabh1007 closed 9 months ago
Few comments:
Connection Mode - Db URL is actually just an Account Name, So you can't actually identify if it is direct or gateway mode by this url. User Agent : user agent string can be large but it can not be very large (as we are limiting the size due to other constraint we have). It is something already generated at the time of client instance creation So overhead is just to put it into dictionary.
Added list of Default Information
Thank you for summarizing it!
I have a couple of comments:
Http attributes: we should not use HTTP attributes on DB calls. If there are nested HTTP calls, they will be traced automatically and will have all required attributes. So I'd suggest removing below attributes. If it's really helpful to have them and they apply to DB call, we should populate them with db prefix and pick a right name
http.status_code
http.user_agent
http.request_content_length
http.response_content_length
db.url
- based on conversation with AzMon team, we should use Microsoft Common Schema attributes which means we should populate net.peer.name
and net.peer.port
(optional if default) instead of url. Path fragments (if important) should appear as attributes (db.name
is already here, should we also add collection name as db.cosmosdb.collection
?)
I think we can make both the changes so replacing http
. with db.cosmosdb
. @kirankumarkolli @FabianMeiswinkel do you guys have any question or concern?
I disagree on replacing db.url with net.peer.name and net.peer.port - the url contains also critical information like partitionId and replicaId = the dns-hostname/ip + port is only pointing to the transport endpoint which can be considered a load pbalancer in front of the pods used for the replica So min bar we also add PartitionId and ReplicaId - but even then the semantic meaning of the ip+port is probably very idfferent than what the common schema assumes (the link points to a non-existing repo - so couldn't check)
the link points to a non-existing repo - so couldn't check
@FabianMeiswinkel you need to join Microsoft org to see this repo: https://repos.opensource.microsoft.com/orgs/microsoft
I disagree on replacing db.url with net.peer.name and net.peer.port - the url contains also critical information like partitionId and replicaId = the dns-hostname/ip + port is only pointing to the transport endpoint which can be considered a load pbalancer in front of the pods used for the replica
The critical information should go to dedicated attributes - let's list what's critical. We could also keep db.url
, burt have to populate net.peer.name
and net.peer.port
too.
The reason is that APM tools don't know what db.url
is and even if we convince AzMon to understand it, other tools like Splunk, Dynatrace, AppDynamics, NewRelic, Lightstep, Honeycomb, you name it would not be able to properly show Cosmos. Following conventions is essential to make sure all backends show Cosmos reasonably well.
If we put critical info into dedicated attributes instead, we'd allow users to query by partitionId or replicaId. As long as those are part of url, we won't have this capability with an arbitrary tracing tool.
@FabianMeiswinkel This db.url
won't have replicaid and partition id information. This will be just account name/url.
@lmolkova do you mean db.cosmosdb.container
by db.cosmosdb.collection
? Container Name we are already appending with database name.
@lmolkova do you mean
db.cosmosdb.container
bydb.cosmosdb.collection
? Container Name we are already appending with database name.
container
vs collection
- let's follow the terminology you have - I have no opinion.
db.name | < Database Name and Container Name >
why so? I assume it could be useful to write queries per container or per db and also it contradicts OTel conventions - check out what Mongo does there.
Any reason why we want to keep them in a single attribute?
Let me summarize the changes here:
net.*
vs db.url
: Cosmos Db Url is just a cosmosdb account name, where it goes to get the metadata information only once during initialization (and while refreshing this metadata information). After that all the "real calls" happen at different host/port. So using net.*
attribute, will give wrong impression that calls are going to that particular URL. That's why we want to have it as separate property. I have renamed is to db.cosmosdb.account
db.cosmosdb.partitionId
, as we think that it is useful to have this information to debug certain scenarios.db.cosmosdb.container
@kirankumarkolli @FabianMeiswinkel @lmolkova
Thank you @sourabh1007, it looks good!
I few minor things
az.namespace
must be a registered Azure resource provider namespace - AzMon will be able to use it to navigate to the resource in Azure i.e. it should be Microsoft.DocumentDB
instead of Microsoft.Azure.Cosmos
db.cosmosdb.account
:
sourabhjaintemp
instead of https://sourabhjaintemp.documents.azure.com/
?db.cosmosdb.host
if you want to have domain name in it sourabhjaintemp.documents.azure.com
? net.peer.name
- you have to populate this standard attribute anyway so AppMap can work properly in the Portalnet.peer.name
- it's essential for Azure Monitor AppMap and any other 3P tracing tool. If you provide custom attribute for it, visualization tools won't be able to show the cosmos destination correctly.db.cosmosdb.request_content_length
, db.cosmosdb.response_content_length
) attributes are helpful? If the confidence is low, I suggest to defer adding them - it's never too late to add, but impossible to remove db.cosmosdb.status_code
appears twice. Please don't include HTTP status code on CosmosDB spans - HTTP requests will be traced anywhay and will have status on them.db.cosmosdb.error
- there is a span/scope status to indicate success/failure - can we remove this one?db.cosmosdb.account
yes example should be sourabhjaintemp
.net.peer.name
makes sense here. If it is something which is mandatory, then may be we can decide some workaround (or most close value to have).Thanks @sourabh1007
Cosmosdb doesn't make any network call to a particular URL (it makes call to different URL on the basis of metadata information) , I don't think net.peer.name makes sense here. If it is something which is mandatory, then may be we can decide some workaround (or most close value to have).
Information that appears on net.peer.name
/ip/port represent remote destination. Application Map will show cosmos node based on it and would allow customers to see which CosmosDB (account) they connect to. They don't have to be precise network destinations, it's enough to have net.peer.name
set to sourabhjaintemp.documents.azure.com
(or regional endpoint if you want users to distinguish them).
I hear your concerns about the name, but if you use a different attribute, we'll have a hard time with both 1st and 3rd party tools visualizing it properly.
net.peer.name
Can we use a full hostname like sourabhjaintemp.documents.azure.com
? Could be important for non-global, non-public clouds.
Also, is there ever a case where non-default port is used on client side? If so, we should also add net.peer.port
when it happens.
net.peer.name
Can we use a full hostname like
sourabhjaintemp.documents.azure.com
? Could be important for non-global, non-public clouds.Also, is there ever a case where non-default port is used on client side? If so, we should also add
net.peer.port
when it happens.
@lmolkova we intentionally used an identifier which can't be resolved into a real hostname/port. The traces above are logical requests (they might end-up hitting x endpoints). net.peer.name and net.peer.port form the name (and semantic usage of how other databases are using it) indicates that this would be a physical network endpoint. You mentioned that an identifier that can be used to identify a resource for Cosmos is required. So we used an identifier that identifies the account - but to avoid confusion used one that clearly is not a physical networking endpoint. Reason why this seems important is that the vast majority of physical communication is not going to this DNS hostname + port (but the actual backend replica/shards)
@FabianMeiswinkel, thanks for the context and it makes total sense. My only worry is that sourabhjaintemp
does not include any identifier of the cloud. We can leave it there for now and later come up with a cross-service attribute for the cloud that can be omitted for a global one to save some bandwidth.
I am totally fine with adding a prefix/suffix for cloud - just want to avoid that anyone would think this is a real network endpoint
Been playing with attributes, a few more points:
db.cosmosdb.machine_id
- all APM tolls usually have some unique instance id. AzureMonitor has it too, there is no need to define your own attribute.db.cosmosdb.service_endpoints_contacted
- should it be an event attribute? Do you expect any special treatment for this property?db.cosmosdb.regions_contacted
- what data you expect to be there?db.operation
(Create/Read/Upsert/Replace/Patch/Delete/Query
) seems to be inconsistent with OTel conventions, in case of Mongo, it's command, something like findAndModify
. i.e. having DeleteStreamAsync
instead of Delete
there would make more sencedb.cosmosdb.request_diagnostics
- do we want to stamp it as an attribute? If it's an event, let's describe it as an event, not attribute.1.db.cosmosdb.machine_id
: we can remove it..if we are able to filter and identify if failures are coming from same machine or different machine.
db.cosmosdb.regions_contacted
: CosmosDb call can failover to mutiple regions. This value can help us to identify if problem is in some particular region. So cover those scenarios where we are expecting/not expecting failovers.
We would need an ability to filter on this attribute in order to this analysis.db.operation
: in CosmosDb we don't command, we just have function name which gets call and Operation type.Been playing with attributes, a few more points:
db.cosmosdb.machine_id
- all APM tolls usually have some unique instance id. AzureMonitor has it too, there is no need to define your own attribute.db.cosmosdb.service_endpoints_contacted
- should it be an event attribute? Do you expect any special treatment for this property?db.cosmosdb.regions_contacted
- what data you expect to be there?db.operation
(Create/Read/Upsert/Replace/Patch/Delete/Query
) seems to be inconsistent with OTel conventions, in case of Mongo, it's command, something likefindAndModify
. i.e. havingDeleteStreamAsync
instead ofDelete
there would make more sencedb.cosmosdb.request_diagnostics
- do we want to stamp it as an attribute? If it's an event, let's describe it as an event, not attribute.
db.cosmosdb.machine_id
- the captured machine Id if available will be the "Azure VM Unique ID" (https://azure.microsoft.com/en-us/blog/accessing-and-using-azure-vm-unique-id/) - main purpose to try to capture this identifier is to simplify getting the containerId and/or nodeId which are required to start any Network Investigation (via aka.ms/netvma or when you try to engage the CloudNet teams). If Azure monitor has a machine identifier for which there is a reaosnably simplye way to get containerId/nodeId we are good - otherwise I think we should keep it here. Because getting customers to identify the containerId/nodeId in netwrok/connectivity incidents has been really painful in the past.
main purpose to try to capture this identifier is to simplify getting the containerId
Azure Monitor supports reading Azure-service specific identifiers (on Functions, AppServices, Azure VMs) and falls back to machine name, which in container environments, I believe, is container id. There is no need to replicate this complicated logic.
main purpose to try to capture this identifier is to simplify getting the containerId
Azure Monitor supports reading Azure-service specific identifiers (on Functions, AppServices, Azure VMs) and falls back to machine name, which in container environments, I believe, is container id. There is no need to replicate this complicated logic.
Agreed as long as there is really an easy way to get containerId/nodeId from what is getting logged. I can just tell form past experience that neither on Azure Functions (automatic logging) nor App Services (manual logging) there has been an easy way for customers to find this info when they logged to AppInsights. I am happy to test this if someone from AzureMonitor can provide the Kust query or similar that should reliably resolve nodeId/containerId from the attributes in the base schema. If that works, we can definitely drop the machine_Id attribute here.
Closing as it is already implemented.
Status: Experimental Target SDK: .Net SDK
Operations To Cover
Database / Container Operations : Cosmos.CreateDatabaseAsync / Cosmos.CreateContainerAsync / Cosmos.DeleteStreamAsync Point Operations : Cosmos.CreateItemAsync / Cosmos.ReadItemAsync / Cosmos.UpsertItemAsync / Cosmos.ReplaceItemAsync / Cosmos.PatchItemAsync / Cosmos.DeleteItemAsync Stream Operations : Cosmos.CreateItemStreamAsync / Cosmos.UpsertItemStreamAsync / Cosmos.ReadItemStreamAsync / Cosmos.ReplaceItemStreamAsync / Cosmos.PatchItemStreamAsync / Cosmos.DeleteItemStreamAsync Batch Operations : TBD Bulk Operation : TBD Query Operations : Cosmos.Typed FeedIterator ReadNextAsync (for each page)
Attributes
AI Default Attributes:
Other default attributes are there e.g device details
Proposal For .Net Cosmos DB SDK
Custom Attributes:
Azure Specific Attributes
Common Database Attributes
Cosmos DB Specific
Request Level Information:
Open Telemetry Standard for any Exception
java.net.ConnectException; ``OSError
Division by zero; Can't convert 'int' object to str implicitly
Exception in thread "main" java.lang.RuntimeException: Test exception\n at com.example.GenerateTrace.methodB(GenerateTrace.java:13)\n at com.example.GenerateTrace.methodA(GenerateTrace.java:9)\n at com.example.GenerateTrace.main(GenerateTrace.java:5)
Following Open Telemetry Conventions (Note: Status of below conventions is Experimental):
1. https://github.com/Azure/azure-sdk/blob/main/docs/tracing/distributed-tracing-conventions.yml