Open rynowak opened 3 years ago
/cc @davidfowl @noahfalk
@rynowak thanks for reporting for duty.
I imagine this is not only true for Blazor but for Websockets/SignalR in general.
Is there something we can do here to "disconnect" the distributed tracing?
Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?
Thanks for contacting us.
We're moving this issue to the Next sprint planning
milestone for future evaluation / consideration. We will evaluate the request when we are planning the work for the next milestone. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.
Is there something we can do here to "disconnect" the distributed tracing?
I'm currently working around this by setting Activity.Current = null
, and it seems to do the trick. I would imagine that a framework level fix would create a new root activity for each operation that comes through the circuit handler. You can use tags/baggage to attach things like a circuit ID so that correlation is possible.
So it is can be controlled ....
Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?
That's a great question 😁
I think the major downside of the current behavior is that a really really long-lived circuit could include thousands of descendant spans. Think about @danroth27 giving the counter demo for 10 hours as part of a charity livestream. In the worst case this will break downstream tooling, or in the best case just give you data that's hard to work with. I'm really not sure.
I might also be the first person to notice this, so I figured it was worth a discussion 🤷
@rynowak you bring up good points.
I'm not super familiar with the activity id APIs, but In general I believe the fix here is to "detach it" from the original request activity and then to start a new one per received message on the circuit as you suggest, and then tag it with the circuit info
cc @tarekgh
Thanks for raising this @rynowak. I'll try to take a little time in the next day or two to repro this and understand where the Activities are currently being created so I have more context. I may also find alternative suggestions on how the issue can be resolved. I imagine as .NET developers use distributed tracing more frequently it will behoove us to have good guidance and understanding of common patterns that show up.
You can use tags/baggage to attach things like a circuit ID so that correlation is possible
Links are a new concept added in the 5.0 release of DiagnosticSource.dll that may also be useful here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md#links-between-spans
Can we get a debugger dump which we can see where the Activity objects get created. That may be faster I guess if possible. I am guessing there is some activities started and never stopped.
It's an issue we need to address in signalr and then we can have a broader discussion about what behavior we want in general.
I didn't have any background on how the Blazor app front-end and back-end communicate, but given @davidfowl's comment + a little debugging it appears to be SignalR protocol over WebSocket, is that accurate?
In repro app there was a short series of paired Activity start/stops but then 8th Activity start is when Kestrel receives the HTTP request that initiates the WebSocket and that one is long lived by design. Additional activity start/stop pairs get recorded re-entrantly as the server sends messages to itself when the button is clicked. In terms of what to do about it...
[Blazor Server] appears to be SignalR protocol over WebSocket, is that accurate?
Yup
On the server when the nested SignalR protocol provides a new trace-id we could clear the HTTP level Activity and start the new one. Ideally we'd also restore the HTTP Activity after responding. There is a bug doing that today but I assume it will get fixed or we could work around it by manually setting Activity.Current. There may also be some value in adding a link between the HTTP level activity and the SignalR one.
Related to https://github.com/dotnet/aspnetcore/issues/18711
cc @BrennanConroy
I'm gonna take a stab at this. I think it's more than the activity though that's related here. I don't think we should be propagating the execution context from the request to the hub execution. It's especially broken for long polling scenarios where there are multiple requests throughout the execution of the hub.
Ryan wrote:
I might also be the first person to notice this, so I figured it was worth a discussion 🤷
I do remember someone else mentioning this in the context of SignalR a while back, I'm not having any luck finding the issue again unfortunately.
I'm not sure how useful it's going to be to have a separate activity ID for every message flowing across the websocket connection. The traces will then just say things like "Within circuit abcdef
, following receipt of event 4728
, we made HTTP requests to the following URLs...". Does the "event 4728
" information help the person reading the trace?
So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit. If this was an area of growing customer demand, we could start pulling more information out of the rendertree and navigation manager so (for example) we could say things like "Within circuit abcdef
, while on URL /counter
, in reaction to a click
event on a button
element with class btn counter
rendered by CounterComponent
, we made HTTP requests to the following URLs...". But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology. Also we'd have to use something like async locals to correlate actions to their causal events.
Another argument would be that tracing technologies could offer more value by moving beyond thinking in terms of short HTTP round trips and have first-class support for long-running connections (WebSockets in general, not just Blazor Server). I know that would only address the UX concern about how info is presented in the tracing UI, so it's still applicable to think about how Blazor Server can provide more human-readable context information as per the previous paragraph.
So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit
I think its fine to decide that this isn't a priority for you right now if you aren't seeing enough signal that customers care. My guess is that customers will care about this in proportion to how often apps using SignalR are sluggish and how easy it is to correlate actions on one machine to specific code that runs on another tier. For example if there is always a nice 1:1 correspondence between pushing a button in a Blazor front-end and OnButtonClicked() method running on a single back-end + all the expensive work happens inside OnButtonClicked() then the value of instrumenting the SignalR protocol is fairly low. On the other hand if those SignalR communications are bouncing across relay servers and lots of different code on different machines is being invoked in response then there are numerous places where performance issues could creep in. It would be very hard to track it down without timestamps and causal links on the messages being sent and received at each tier.
If we don't want to do the work of sending identifiers on each message I do think there are lower level amounts of work that would still be quite helpful:
But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology
For a server<->server SignalR communication many of the supporting pieces are already in place and I could walk you through what remains. For client<->server we presumably also need one of the tracing uploaders like OpenTelemetry or ApplicationInsights to support running in the blazor client environment. I'm happy to get the right people in touch but I suspect its more a question or priorities at this point.
"Within circuit abcdef, following receipt of event 4728, we made HTTP requests to the following URLs...". Does the "event 4728" information help the person reading the trace?
It helps if there is instrumentation on the other end of the communication to tie it to. For server<->server usage there would probably be some of that instrumentation in place. For Blazor client<->server it would be there only if you added it (which isn't hard) and a telemetry uploader supported transmitting it (this part is likely a bit more work).
Hope that helps? I'm glad to chat online if that is easier, I know there is a lot here : )
We're having the same problem.
Setup:
Example:
If we are interested in debugging POST /Appointments in Azure Application Insights end-to-end transaction, then we see ALL distributed tracing for ALL Blazor requests to the public APIs, including all internal API calls, SQL commands, Redis requests etc, for each and every Blazor http request since the user visited the web app.
This becomes extremely verbose and makes debugging even for short user sessions with Application Insights end-to-end transaction impractical. I would expect to be able to explicitly start activities in Blazor to logically group related Blazor requests and to opt-out of the Blazor circuit's root activity.
A workaround we are experimenting with right now is setting Activity.Current = null
in several key places, such as upon creating a new circuit and upon resolving specific early scoped dependencies (per circuit).
We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.
Workarounds are now listed here: https://github.com/dotnet/runtime/issues/41072#issuecomment-879395715
Are those all documented?
Thanks for contacting us.
We're moving this issue to the .NET 7 Planning
milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.
I cannot tell if there is some way to hook in Activity/ActivitySource (OpenTelemetry) into SignalR/Blazor server interactions.
There seems to be no way to hook into the pipeline for the websocket to see interactions no matter how noisy they may be. There is a desire to see tracing when a user clicks on a button and what outgoing HTTP, DB interactions, etc. happen just like an a regular HTTP API.
I recognize the problem isn't as easy as a lot is going on on the websocket. I was just hoping this problem was being worked on. This seems like the more relevant issue to follow. Thanks!
What is the recommended interim solution for websockets tracing to avoid this issue? Right now, entire connection and messages communicated are traced under one trace id and I can't find a proper way to break that in to smaller spans.
https://github.com/dotnet/aspnetcore/issues/29846#issuecomment-771775926
You can create individual spans for each websocket message by making your own activity for them.
Excuse my limited knowledge on this.
For suggested solution above, do we have to indicate Activity.Current = Null after initial websocket connection established?
After initial connection is established for websockets, how can we create new activity for each message? Those subsequent messages won't go through middleware pipeline, right?
It stinks I can't add some kind of hook to instrument things. Blazor Server blurs the lines a lot so knowing where the appropriate spot can be tricky but I'll go down this road for now.
If anyone has any guidance to share on instrumenting Blazor Server, I'd be interested!
I just want to add that the behavior is different when using Azure SignalR Service. Every invocation of a request on the server will result in a new trace. It makes it a bit more confusing when the tracing will differ on the developer's machine (not using AzSignalR) vs. in production (using AzSignalR).
We've discussed this within the team today and have concluded that the best course of action for now will be to document how this can be achieved. For that:
We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.
Is there any progress on this? Are there any docs how to use tracing with a blazor server app? Regarding e.g.
A workaround we are experimenting with right now is setting Activity.Current = null in several key places, such as upon creating a new circuit and upon resolving specific early scoped dependencies (per circuit).
How to get these workarounds right? Where to set Activity.Current
?
Wouldn't it be pragmatic not to create a new Activity for every WebSocket message but rather for every navigation using NavigationManager
? I guess this is somewhat opinionated so it might not be a framework-wide solution, but my gut feeling is that it would be better than having a single Activity per Circuit lifetime:
I think it'd be great to have some docs regarding Blazor Server and tracing, maybe with examples how to
Activity.Current = null
in the right place achieves)NavigationManager.LocationChanged
+1 Would really appreciate some guidance here in the meantime. It's a real customer problem today.
Have some long running Blazor Server dashboard pages, and my Application Insights is a mess. One 'Operation Id' is overloaded with so much it's hard to find what I'm looking for in there. As the API see's it, these really are separate operations but the Blazor app insists on stuffing the same traceid header onto every one.
If the default behaviour is not as I expect, I'd like to know how I can conditionally control this on the Blazor Server side. For example something I could do with the injected IHttpClientFactory or HttpClient itself to disconnect it from the ambient 'activity'.
Thanks for contacting us.
We're moving this issue to the .NET 9 Planning
milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.
+1 on guidance for working around this. Where to put Activity.Current = null
exactly?
SignalR is planning to add activities for SignalR invocations: #51557. We should understand the impact of this work on Blazor Server scenarios and what additional activities we need to add for Blazor Server specifically.
The situation here has improved in .NET 9. All Blazor server interactions will no longer be rooted under a single request span and instead, will render as individual spans in distributed tracing tools.
Blazor has no built-in activities, and we'll need to explore which of the many suggestions on this thread to implement. In .NET 9, you can create an activity to represent the logical operation being performed (for e.g. Buy Item) and that will show up something like this:
e.g.:
An HTTP request nested underneath the "IncrementCount" span.
That's awesome progress. I'm still a little disappointed to see the issue title change 😢. Some of my best work IMO.
I subscribed to this issue just to see the title pop up occasionally in my notifications and make me smile lol 🤣
Describe the bug
The long-lived circuits of Blazor server make distributed tracing not work as expected.
Since each circuit is effectively a long-lived request ... a lot of activity (pun intended) ends up getting traced under the same activity. Basically all outgoing requests/traces from inside of a Blazor server circuit (browser tab) will have the same parent span, which causes them to all get grouped together as a single "root cause" in tracing UIs.
To Reproduce
Create a Blazor server application. Use a button to fire off HTTP requests to another ASP.NET Core endpoint. Hook all of this up to your favorite distributed tracing system and enjoy.
All of the operations you start from a Blazor circuit will end inside the same single logical operation from a tracing POV.