dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.44k stars 10.03k forks source link

Blazor Server keep-alive & reconnection issues #48724

Open garrettlondon1 opened 1 year ago

garrettlondon1 commented 1 year ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe the problem.

SignalR connection has a Keep-Alive (default 15 seconds) which is the criteria for the reconnection modal appearing.

This means if a button is clicked on Web, and the keep-alive just triggered 1 second ago, when the client disconnects they're waiting 14 seconds for the reconnection modal..

No DOM updates can be communicated over SignalR and user can continue clicking buttons

The worst part of this is:

This is a very poor user experience for users who expect feedback from buttons/actions that cannot be communicated from the server to the client via the broken socket connection.

The developer has no way of intercepting any client side event within this potential 14 seconds of UI interaction

The reconnection modal is fine, you can customize the appearance pretty nicely. It's just about the developer meeting the reconnection process when it happens, not 15 seconds after it happens

Repro steps

Expected result: The disconnected UI is displayed almost immediately prior to or in response to any UI interactions Actual result: The disconnected UI isn't displayed and clicking on the UI elements does nothing for about 15sec

Describe the solution you'd like

Have not thought about the implications of this, but depending on client side interactions, we could reset the keep-alive and ping on every @onclick handler (or any other condition) and throttle it every 2-3 seconds to prevent continuous pinging of the server if a user spam clicks?

It seems silly to prevent the .1% or less times that the user will get disconnected and experience UI hanging, but it is such a significant issue (if reconnect modal doesn't properly appear), that I feel it's warranted.

Additional context

Yes state being lost, and ability to resume circuits is important, but I think this is a bigger fundamental problem. One experience of this will pretty much turn you off of the framework for any enterprise web app, and send you the SPA way.

ghost commented 1 year ago

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

MackinnonBuck commented 1 year ago

Thanks for reaching out! We've moved this issue to the backlog for now to gather additional feedback from the community about whether this is a change that should be made.

surayya-MS commented 1 year ago

Related https://github.com/dotnet/aspnetcore/issues/48675

halter73 commented 1 year ago

Have not thought about the implications of this, but depending on client side interactions, we could reset the keep-alive and ping on every @OnClick handler (or any other condition) and throttle it every 2-3 seconds to prevent continuous pinging of the server if a user spam clicks?

I'm pretty sure the SignalR Core clients already reset the keep-alive timeout every time they receive any message from the server, not just a keep-alive. I also think the server only bothers sending a keep alive if there are no other messages sent within the KeepAliveInterval. Correct me if I'm wrong @BrennanConroy.

What happens if you lower the KeepAliveInterval and ClientTimeoutInterval?

garrettlondon1 commented 1 year ago

@halter73 I don't think it's about the keep-alive on the server. It's really the client side keep-alive that matters, how often the client pings the server. Blazor Server you can specify the HubOptions in Startup for the server, and modify the blazor server.js startup to change on the client as seen in #48675. The client needs to know if it's disconnected in order to show the Reconnection modal defined in _Host.cshtml, which lives on the client.

I lowered the client side keep-alive to 1 second, which easily detects disconnects super quickly. The problem is, no matter what you do, even if you do nothing, you will spam my server every 1 second.. although It's working nicely.. The Client Timeout interval is when the "Disconnected" modal appears, so I think it's fine to leave at any value really, since the main problem I'm trying to solve here is immediately show "Reconnecting" when client disconnects, not when the next keep-alive (15 seconds by default) is executed.

garrettlondon1 commented 1 year ago

Related #32113

garrettlondon1 commented 1 year ago

https://learn.microsoft.com/en-us/aspnet/core/blazor/host-and-deploy/server?view=aspnetcore-7.0

The KeepAliveInterval isn't directly related to the reconnection UI appearing. The Keep-Alive interval doesn't necessarily need to be changed. If the reconnection UI appearance issue is due to timeouts, the ClientTimeoutInterval and HandshakeTimeout can be increased and the Keep-Alive interval can remain the same. The important consideration is that if you change the Keep-Alive interval, make sure that the client timeout value is at least double the value of the Keep-Alive interval and that the Keep-Alive interval on the client matches the server setting.

Can the docs provide more clarity about how the "Keep-Alive isn't directly related to the reconnection UI appearing"?

alexyakunin commented 1 year ago

I don't think this happens exactly as you describe - as far as I understand,

The idea to send keep-alive to detect the disconnection faster seem meaningless, because TCP implies that any transmission requires ACK, not only keep-alive packet. And it's easy to conclude that such a transmission happens when you click a button - to deliver the event notification to server.

alexyakunin commented 1 year ago

What might be the real cause is SignalR reconnection - I am not sure about all the details, but it might be masking the fact that websocket connection is dead & it tries to reconnect while behaving like nothing happened.

garrettlondon1 commented 1 year ago

I've been going through src/Components/Web.JS/src/Platform/Circuits/DefaultReconnectionHandler.ts to try and get a better understanding.. but no luck.

The Keep-Alive interval (keepAliveIntervalInMilliseconds or KeepAliveInterval) isn't directly related to the reconnection UI appearing. The Keep-Alive interval doesn't necessarily need to be changed. If the reconnection UI appearance issue is due to timeouts, the server timeout can be increased and the Keep-Alive interval can remain the same. The important consideration is that if you change the Keep-Alive interval, make sure that the timeout value is at least double the value of the Keep-Alive interval and that the Keep-Alive interval on the server matches the client setting.

@danroth27 @MackinnonBuck @jongalloway Terms like "isn't directly related" or "doesn't necessarily need to be changed".. fairly confusing for technical documentation.

Imagine you're building a CRUD app and the websocket drops when your user clicks "Create sensitive record", nothing happens, no reconnection modal.. Your user will continue to click that button dozens of times and once the connection re-establishes, now you have flooded your API, and the user is even more frightened.

From a framework perspective server-side blazor needs to be able to offer the application developer the choice to immediately notify the user of the reconnecting process, or leave the default settings. Note: only have experience with Blazor Server deployed on App Service using Azure SignalR

@alexyakunin

but it might be masking the fact that websocket connection is dead & it tries to reconnect while behaving like nothing happened.

If what you are saying here is true: the websocket connection has severed, and it's trying in the background without triggering the Reconnecting modal, why delay the Reconnecting modal?

danroth27 commented 1 year ago

Terms like "isn't directly related" or "doesn't necessarily need to be changed".. fairly confusing for technical documentation.

@garrettlondon1 Agreed! To help get this fixed in the docs, could you please report an issue for the specific doc page by clicking the link at the bottom of the doc page?

image

garrettlondon1 commented 1 year ago

@danroth27 Much appreciated, I will take care of that right now.

garrettlondon1 commented 1 year ago

Hey @mkArtakMSFT curious, why does this fall under the umbrella of Technical Debt? Even with .NET8, my understanding is that nothing has changed with the InteractiveServer components (Blazor Server), they rely on the same SignalR connection.

This issue is geared towards solving one of the biggest and still current issue of Blazor Server itself. Poor UI experience with socket disconnects.

glen-84 commented 11 months ago

If there's no connectivity, it seems dangerous to record all events and replay them on reconnection.

I agree that the disconnected UI should display immediately when an action fails.

garrettlondon1 commented 11 months ago

If there's no connectivity, it seems dangerous to record all events and replay them on reconnection.

I agree that the disconnected UI should display immediately when an action fails.

Exactly @glen-84, very dangerous... For some applications, sure! Example: @captainsafia and @halter73 's AMAZING .NET8 demo with the websocket going down in the chat app "going out for lunch". But "stateful reconnection" is not a one-size fits all solution!

For an enterprise application which has important DB calls.. If a button state DOM change can't even be communicated over SignalR, now the user has no idea that if they click 20 times, 20 db calls will be queued when they reconnect.. And now you're forcing InteractiveServer component developers to implement some kind of rate limiting/serious error handling when they have no public API besides UI interactions?

Interactive Server components are not background websocket connections, where you can just use JS to handle in-component UI state upon SignalR disconnects. The whole thing IS SIGNALR!

ghost commented 11 months ago

Thanks for contacting us.

We're moving this issue to the .NET 9 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s). If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

halter73 commented 10 months ago

We want to look into having the Blazor client validate that each non-keep-alive message it sends to the server gets and quick ack. The default could be as low as 2 seconds but should be configurable. This could probably be implemented using Hub method return values.

garrettlondon1 commented 10 months ago

We want to look into having the Blazor client validate that each non-keep-alive message it sends to the server gets and quick ack. The default could be as low as 2 seconds but should be configurable. This could probably be implemented using Hub method return values.

The UX behavior around this is also important.

For example, on a Blazor Web App which is majority SSR, and one small InteractiveServer component.

In a Blazor Web App that is mostly InteractiveServer based (whole pages, etc).. the traditional reconnection modal is better because if there is no interactivity, the UX is severely degraded as explained above.

ProTip commented 9 months ago

@garrettlondon1 I have worked with the implementation of a lot websocket clients(including Rails action cable, any cable, etc) and created a few myself.. Took a look at the code and I see a few things going on here.

First, most disconnect events should result in WS disconnect events being raised, and the retry strategy should start immediately. This would include network changes, graceful server restarts, connection loses, and etc.

Unfortunately the reconnect strategy is very naive. Instead of some variation of exponential backoff with full jitter, we see a that there is a first attempt of 3s followed immediately by falling back to 20s which is the default retry interval in the options: https://github.com/dotnet/aspnetcore/blob/v8.0.1/src/Components/Web.JS/src/Platform/Circuits/DefaultReconnectionHandler.ts#L69-L72

In my experience, which probably mirrors yours, it's best to do an effectively immediate retry(or maybe after 50ms) followed by some quick but increasing retries before falling back to an exponential backoff. So waiting 3s initially and then falling back to 20s is not a great UX.

I would like to see this strategy pluggable, but it seems there are some options for configuring this like so in App.razor:

<script src="_framework/blazor.web.js" autostart="false"></script>
<script lang="javascript">
    Blazor.start({
        circuit: {
            reconnectionOptions: {
                maxRetries: 100, retryIntervalMilliseconds: 500
            }
        }
    })
</script>

This will prevent Blazor from "auto" booting, then start it manually with reconnection options. It also appears an entire reconnection handler can be swapped in.

This doesn't address the situations where keepalive could help. Such as a communication disruption that doesn't result in a socket disconnect. However you may be able to detect this sooner via a global component that interacts with the server, and then by calling something like window.Blazor.reconnect() . I haven't investigated this thoroughly though.

Edit: I see this was pretty much mentioned in #32113 so not sure what value it adds. But I will say using "Offline" throttling has behavior atypical to a network disconnect. It will not actually disconnect the websocket connection, it just seems to eventually trigger the keep alive timeout.

vgallegob commented 9 months ago

This issue is geared towards solving one of the biggest and still current issue of Blazor Server itself. Poor UI experience with socket disconnects.

I agree, the reconnects are very anoying. It would be nice to have more control on how the circuit gets shut down.

For example, I have a fixed maximum number of circuits (say 500 sessions). I would like to be able to leave the circuit alive on the server till reconnected. But I would not want SignalR to buffer messages, I think this would be very dangerous.

In my case a user can not login in 2 devices at the same time. This means I could basically kill circuits coming from a old device when a user login from a new device. And I could keep the new device circuits alive indeterminately, or at least for some hours.

this is just some brainstorming on ways we could improve reconnects.

Johnhersh commented 9 months ago

Doesn't SignalR have a fallback mechanism to use just regular http requests if web sockets are not available?

garrettlondon1 commented 7 months ago

This doesn't address the situations where keepalive could help. Such as a communication disruption that doesn't result in a socket disconnect. However you may be able to detect this sooner via a global component that interacts with the server, and then by calling something like window.Blazor.reconnect() . I haven't investigated this thoroughly though.

@ProTip you hit the nail on the head here..

A communication disruption, which does not result in a socket disconnect, in InteractiveServer mode, continuously builds up UI events to send to the server at once upon reconnecting.

I don't believe this is preventable by a Blazor Server developer. If the user clicks the button 10 times, because nothing happened, now you have 10 calls to your DB

garrettlondon1 commented 7 months ago

@mkArtakMSFT I believe @danroth27 added the "Bug" label for a reason here, I don't agree that it is an enhancement:

Repro steps

Create a Blazor Server app and run it Browse to the app with the browser dev tools open On the networking tab in the browser dev tools set the network throttling to offline Try to use the app

This is a pretty basic bug I believe. When you disconnect, the reconnection modal does not appear, and the developer cannot improve the experience.

garrettlondon1 commented 6 months ago

Related #55127