Consider adding activities for potentially long running network operations

davidfowl commented 8 months ago

Edit by @antonfirsov: The plan is to implement this feature request by introducing a separate Activity for each pooled connection, and link those to the HTTP Request Activity via ActivityLink. See https://github.com/dotnet/runtime/issues/93832#issuecomment-2057794955.

While diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution.

This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).

cc @noahfalk @samsp-msft

ghost commented 8 months ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details

While diving into more distributed traces it would be interesting to break down outbound HTTP request time into a few more pieces to understand exactly where that time is spent. I think it would be interesting to explore adding an Activity for socket connection attempts and DNS resolution. This activity could be coupled to HTTP traffic, but I think it would make more sense to push this down into the Socket/Dns APIs. That would make it work for all sorts of client APIs that use the underlying socket API (redis, various database drivers etc).

Author:	davidfowl
Assignees:	-
Labels:	`area-System.Net`
Milestone:	-

MihaZupan commented 8 months ago

I think instrumenting DNS/Sockets with more information would be fine, but I want to point out that it could become misleading / not match user expectations. Related: https://github.com/dotnet/runtime/issues/63159#issuecomment-1515432131

You will rarely see a "perfect" trace like this

◄────────────────────HTTP Request────────────────────►
  ◄──DNS──►
           ◄──Socket──►
                       ◄──TLS──►
                                  ◄──Dependency──►

When a request comes in it may kick off a new connection attempt, but then get picked up by a different connection that became available before the new connection is established. As a result, you could see traces like

◄─────────HTTP Request───────────►
    ◄──────DNS─────►
      ◄──Backend dependency─►
                    ◄────Socket────►
                                    ◄─────TLS────►

Or even

◄─────────HTTP Request───────────►
      ◄──Backend dependency─►
                                       ◄──────DNS─────►
                                                       ◄────Socket────► ...

From the HTTP perspective, you could avoid some confusion by inserting a different activity instead

◄────────────────────────HTTP Request───────────────────────────►
  ◄──Waiting for an available connection──►
                                             ◄──Dependency──►

That highlights the issue that the HTTP request is decoupled from the establishment of new connections.

davidfowl commented 8 months ago

From the HTTP perspective, you could avoid some confusion by inserting a different activity instead

This seems ideal for the HTTP request case. It seems then it's possible that we'd also want to disable the dns and socket activities for the HTTP case? I'd be interested to see if we can get a similar trace for connection establishment in other clients and see if a similar problem exists (presumably it does if there's smart connection pooling going on).

samsp-msft commented 8 months ago

This feels like something that should be done with Events on activities. Putting too many extra activities for this sub-operations will be expensive and noisy. I wonder if the Activity class should have a verbosity setting that would be inherited from its parent, and then code can query the current activity for it, and use that to determine whether to add extra data. That could be extended to be somewhat similar to ILogger, with methods to add event data that will be conditional on the requested verbosity level. The advantage of using Events on the Activity is that they include a timestamp, which makes sense for things like DNS or creating a connection. They could be skipped if the connection already exists.

davidfowl commented 8 months ago

Putting too many extra activities for this sub-operations will be expensive and noisy.

I agree in general but not for expensive networking operations.

The advantage of using Events on the Activity is that they include a timestamp, which makes sense for things like DNS or creating a connection. They could be skipped if the connection already exists.

Lets explore this then, do you have something more concrete we could investigate and visualize in tools?

noahfalk commented 8 months ago

Lets explore this then, do you have something more concrete we could investigate and visualize in tools?

I think Sam is proposing that you visualize this list of span events that is included in the OTLP proto:

https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L232

I assume when you look at it right now it would be empty, but if you want to experiment with putting some data in there you can use the Activity.AddEvent API.

I wonder if the Activity class should have a verbosity setting that would be inherited from its parent, and then code can query the current activity for it, and use that to determine whether to add extra data

This sounds an awful lot like Activity.IsAllDataRequested property (doc guidance). In general Activity events aren't designed to scale to a high verbosity level but a handful of messages with key timings should work fine.

davidfowl commented 8 months ago

Lets take @MihaZupan's scenario, would be add begin/end events for these operations and visualize those as named events on the overall activity timeline? e.g. HttpClient does the following:

Activity.Current.AddEvent(new("wait for connection"));
await GetConnectionAsync();
Activity.Current.AddEvent(new("resolved connection"));

Is that the idea?

antonfirsov commented 5 months ago

@davidfowl is there a specific customer ask behind this proposal (eg, from Aspire)?

davidfowl commented 5 months ago

We were debugging connection slowness with aspire and more details would have helped speed that up.

samsp-msft commented 5 months ago

Lets take @MihaZupan's scenario, would be add begin/end events for these operations and visualize those as named events on the overall activity timeline? e.g. HttpClient does the following:
Activity.Current.AddEvent(new("wait for connection"));
await GetConnectionAsync();
Activity.Current.AddEvent(new("resolved connection"));
Is that the idea?

I would suggest that for the events you don't want begin/end items, just the ones that have been achieved, so you'd get something like:

DNS address resolved
Socket opened
TLS Negotiated
Request sent

If any of them are skipped it doesn't so much matter. But if there is a DNS item in there you can see how that relates to the overall time for the span.

davidfowl commented 5 months ago

I like the idea of using events and also doing it in HttpClient so that we have a better view of the http connection time (based on https://github.com/dotnet/runtime/issues/93832#issuecomment-1775459378)

antonfirsov commented 3 months ago

Triage: this doesn't look critical compared to other asks, tentatively pushing to future. @samsp-msft @davidfowl if you have concerns with that, let's sync!

davidfowl commented 3 months ago

It's not critical no, but it seems cheap to add. Do you have an idea of the cost?

samsp-msft commented 3 months ago

This should be a relatively low cost way to instrument http client in a way that can really help developers understand what is going on in the http stack in a complex production environment. Without this kind of telemetry, they don't know if it's DNS that is taking time, if a new connection is needed etc. A handful of events that can act as markers/signposts in the trace will go a long way to resolving that issue.

antonfirsov commented 3 months ago

I fail to understand how does switching from sub-activities to events answer the concerns from https://github.com/dotnet/runtime/issues/93832#issuecomment-1775459378. If an HTTP request initiates connection attempt A, but then quickly gets served by a connection B that becomes free in the meanwhile (closing the HTTP span), the potentially long running A connection attempt would have no Activity.Current to log its events to, unless we are fine to define an artificial Connection activity that is allowed to have a lifetime mismatching the HTTP Request span:

◄────────HTTP Request───────►
      ▲ ◄────Backend──────►
   ◄──|────────────────────────────────Connection A ────────────────────────►
      |                         ▲                  ▲                      ▲
      |                         ╚⭍ DNS resolved    ╚⭍ socket connected    ╚⭍ TLS done
      |
      ╚ served by connection B

Is this really what we want? Or am I'm misunderstanding something?

Note: the Waiting for an available connection activity proposed at the end of https://github.com/dotnet/runtime/issues/93832#issuecomment-1775459378 has no knowledge of DNS/socket/TLS. It's about the request sitting in the queue until a connection becomes available.

antonfirsov commented 3 months ago

they don't know if it's DNS that is taking time

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add more metrics instead?

davidfowl commented 3 months ago

I fail to understand how does switching from sub-activities to events answer the concerns from https://github.com/dotnet/runtime/issues/93832#issuecomment-1775459378. If an HTTP request initiates connection attempt A, but then quickly gets served by a connection B that becomes free in the meanwhile (closing the HTTP span), the potentially long running A connection attempt would have no Activity.Current to log its events to, unless we are fine to define an artificial Connection activity that is allowed to have a lifetime mismatching the HTTP Request span:

I thought we agreed that doing it at the http connection layer makes that problem go away. We would step back from trying to do it at the dns layer and focus on http only.

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add https://github.com/dotnet/runtime/issues/88383 instead?

I don't why it would be better. If I had to pick one though I would pick this one.

samsp-msft commented 3 months ago

For the purpose of this - we shouldn't be tracking connections as activities as they are not specific to a single request - they are potentially long running, and for the most case outside the control of the user. The get requests from the client are what the developer has control over, and so wants to understand what is happening. That's why I am suggesting putting events on that activity that correspond to how that request was processed. If it needs to establish a connection, and we can do the association between that connection and the request, then lets tag the request with the events that did occur with processing the request. If it turns out that dns was slow, or establishing the connection was slow, then those will show as late markers on the activity. It it takes time to get a connection from the pool, that will also show. The beauty of firing events is that they are optional - it doesn't have to try and line up parent/child relationships for them.

We have the dns.lookup.duration metric for this. Instead of headaches with activity lifetimes, wouldn't it be better to add https://github.com/dotnet/runtime/issues/88383 instead?

Metrics tell you the aggregate story, but not what happened for an individual request.

In the above image, the bottom blue bar is the http client activity for making a request to the API service. The yellow bar is the API service handling that request. There is a 129ms gap between them - currently the developer has no idea what that is due to. The goal of this should be to supply events so the beginning of the outgoing trace includes the info as to why.

samsp-msft commented 3 months ago

I talked with @antonfirsov and have a bit more of an idea of the problem. Because of the connection pool the requests and connections are pretty independent, its only when the request is sent on an established connection that the relationship is created, and then its only for the duration of that request. One possibility that I am warming up to is to create separate activities/spans for the connections, and then use links between the activities/spands(rather than a parent child) to establish the relationships. The connection could have events for each of its stages of operation, including DNS, TLS handshake etc. Some of those events could be for specific requests, and then tag the activity for the request as part of the event. Similarly the activity for the request could be given back the id for the activity/span for the connection, which it would add as link. The connection would then not have a parent/child relationship with any of the requests, but would have events that link to the spans. The request would have a link to the connection, and then events on that connection would show when that request was handled.

antonfirsov commented 2 months ago

Looks like this should be feasible with ActivityLinks, however the best way to implement it would be by utilizing #97680:

Without Activity.AddLink(link) we need to know the connection ActivityContext in advance which forces us to create an additional activity request being processed by connection when a connection becomes available. Such an activity would bring overhead with little or no practical value.
@samsp-msft 's https://github.com/dotnet/runtime/issues/93832#issuecomment-2057794955 proposes to link the connection being assigned to the request. This is an unambiguous link between a request and connection, however there would be no links to failing connections since they are never dispatched to requests. This feels like loss of important diagnostic data to me. To overcome this, we may consider also linking the connection an incoming request spans (if any). If a connection fails it would be visible on its originating request telemetry data, however note that the failure of a connection doesn't necessarily imply that the originating request failed. If we decide to add this link as well, we definitely need #97680.

martinjt commented 2 months ago

Can I suggest that if this is implemented with Links that a separate ActivitySource is used.

This will ensure that a user from the outside will be able to opt-in to create those Activity object and not have to do post-processing in the collector, or in a custom sampler.

For events, there is an issue around the size of the individual Activity/Span if it get really noisy (from a backend perspective).

I suppose overall, the ask is, can we make this opt-in vs opt-out, and for links the best way to do this is a separate ActivitySource.

samsp-msft commented 2 months ago

Can I suggest that if this is implemented with Links that a separate ActivitySource is used.

That makes sense. Use something like System.Net.Http.Connections for the connection ActivitySource. If there is no connection Activity, it won't get linked on the request, and vice versa, the connection won't link back for the request.

martinjt commented 2 months ago

That's the idea, but it would be a check on the listeners for the ActivitySource for the connections instead of the Activity null check I would imagine.

If its a link from the http Activity, to the connection though, I'd imagine you don't actually need the .AddLink() support since the Http Activity would be created at a point when the connection activity has been created.

The other question I had was the structure of the connection trace, and specifically, which Activity are you planning on linking to in that tree.

The edge case I'm thinking about is the fact that you could add listeners to the Connections ActivitySource after the application has started, and since those connections are long lived, there may not have been an Activity created. You're also having to store the trace and Span information in memory for every connection, so that may add up in memory?

antonfirsov commented 2 months ago

If its a link from the http Activity, to the connection though, I'd imagine you don't actually need the .AddLink() support since the Http Activity would be created at a point when the connection activity has been created.

The activity is being created and started in DiagnosticHandler before forwarding the request to the innermost platform handler. We have no knowledge of the concept of connections at that point.

https://github.com/dotnet/runtime/blob/9068070922266d6b8f26775ac537419725c6fa2d/src/libraries/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs#L112-L117

The other question I had was the structure of the connection trace, and specifically, which Activity are you planning on linking to in that tree.

There would be a root Activity for every connection which can be linked to the requests using that connection when using SocketsHttpHandler connection tracing enabled. DNS+TCP+TLS diagnostic info would be added either by spinning up sub-activities from the connection activity or by emitting ActivityEvents to the connection activity.

it would be a check on the listeners for the ActivitySource for the connections instead of the Activity null check

I have yet to understand and figure out what's best here, but I think the conditions for creating a connection activity would be similar to the conditions for creating the request activity just with other ActivitySource & DiagnosticListener:

https://github.com/dotnet/runtime/blob/9068070922266d6b8f26775ac537419725c6fa2d/src/libraries/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs#L59-L70

The edge case I'm thinking about is the fact that you could add listeners to the Connections ActivitySource after the application has started, and since those connections are long lived, there may not have been an Activity created.

This is an existing problem with connection metrics already. I think we have no other choice but to accept it, I don't se a solution with the existing distributed tracing APIs.

You're also having to store the trace and Span information in memory for every connection, so that may add up in memory?

I'm not sure if connection tracing costs would be significant compared to request tracing costs under high load, we need to benchmark to see. It will be an opt-in feature anyways.

davidfowl commented 1 day ago

🥳

dotnet / runtime

Consider adding activities for potentially long running network operations #93832