Closed AbhiPrasad closed 1 year ago
From the Slack thread we'll also need:
SELECT
/DELETE
etc.), and http method for HTTP spans (GET
/PUT
etc.)We can add all of the db call-level attributes if possible: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/database/#call-level-attributes: db.operation
.
db.operation
in this case would be the field that would eventually be extracted to be action
. Note that we can only provide db.operation
if the underlying framework gives us this information, otherwise it will have to be extracted from the query - which has to be done server-side.
I'm still confused about span status - we attach http status codes under span.status
already on the SDK. Can we not check for a 504
or a 408
and label timeout appropriately? Why does this require SDK changes?
What is a client timeout here? Presumably if something times out, you never get a response and thus no status code. What do SDKs report in that case? I wouldn't be surprised if a timeout usually translates into some sort of socket error which requires custom handling and might either result in a sentry error (and failed transaction), or is manually handled and the output is undefined.
Based on a slack thread - here's a snippet from @gggritso about behaviour that he would like to see for timeouts.
import urllib
import urllib.request
try:
response = urllib.request.urlopen('http://python.org/', timeout=0.0001)
html = response.read()
except urllib.error.URLError as e:
print("URL Error", e.reason)
if (e.reason == 'timed out'):
span['status'] = 'time out'
raise e
Given we've released the Python SDK with cache hit information, I think we can say we have enough data for v0 of starfish.
action (db operation, http method) is going to be parsed by Relay for the most part (but both Node/Python should be sending them regardless)
span status is still an open concern, but I think we can defer that will v1. cc @alexjillard
the SDK should be sending http.method where possible under span data. In both node/python it was determined that none of the underlying instrumentation could parse and determine
db.operation
so this was something relay needed to do.
So Abhi doesn't have to sign into GH while OOO
Add cache hit/miss rate
Any call to cache that doesn't return data (null) is treated as a miss. We define a span that has a cache hit as having the span data field
cache.hit
as true. Might require SDK to patch cache abstractions of framework (django).To be done in Python SDK for any frameworks that support this - django is required.
Add db platform data to to span data
Add this information to span data field as
db.system
, matching OpenTelemetry's well known conventions.For example,
db.system
ofpostgresql
would indicate that it's a postgres database.Add cache item size
Add a field to
cache
spans that defines how big the item that is being get/set:cache.item_size
. This is an integer and should be in bytes. Blocked by cache hit/miss rate work.Add information used for action field
The action field on the extracted span metrics is either the operation for DB spans (SELECT/DELETE etc.), and http method for HTTP spans (GET/PUT etc.).
For operation, we'll be using and setting
db.operation
on span data. For http method, we'll be usinghttp.method
on span data.If
db.operation
ORhttp.method
is not set on span data, Relay will have to parse it out from the span description.