Undesirable reconnection behaviour

Steve-Mcl commented 1 month ago

Current Behavior

While working on https://github.com/FlowFuse/node-red-dashboard/pull/1343 I noted we manually attempt to recover dashboard connection using fixed times (no jitter) and if that is unsuccessful it performs a full refresh.

In many settings (industrial/professional/signage/etc) this is undesirable. consider the following:

50 displays showing dashboards
Server maintenance is required - server is taken offline for 10 minutes
All 50 screens attempt a refresh and end up showing an ERR_NAME_NOT_RESOLVED
- "This site can’t be reached. xxxxxxxxx.com’s server IP address could not be found. Try: Checking the connection, Checking the proxy, firewall and DNS configuration. ERR_NAME_NOT_RESOLVED"
- These now require manual intervention

Expected Behavior

To use the WS built in capabilities for auto reconnect (e.g. making use of randomizationFactor for the exponential backoff jitter)

Steps To Reproduce

disconnect or power off the server and wait until the dashboard gives up reconnection attempts.

Environment

Dashboard version: 1.17.x (current)
Node-RED version: all
Node.js version: all
npm version: all
Platform/OS: all
Browser: all

Have you provided an initial effort estimate for this issue?

I have provided an initial effort estimate

colinl commented 1 month ago

we manually attempt to recover dashboard connection using fixed times

Can you point to that code please?

Steve-Mcl commented 1 month ago

we manually attempt to recover dashboard connection using fixed times

Can you point to that code please?

https://github.com/FlowFuse/node-red-dashboard/blob/c1b9c207ee39e1415aaca1e57cc84d1ee92164d4/ui/src/main.mjs#L192-L193

colinl commented 1 month ago

There is a subtlety to that code that is not entirely obvious. It actually uses a retry count to determine when to retry. This is important, particularly in the case of Android devices. If the dashboard is left in the background in Android then eventually the page is suspended. The result is that the connection is dropped. It is important that when the page is brought to the foreground that it retries quickly, even if it has been hours since the connection was dropped. Using a retry count, rather than a timer, achieves that.

Steve-Mcl commented 1 month ago

Yes, I understand that. My beef is with the full page refresh. I believe there is a halfway house (or combined logic) between the manual retry logic and the built in retry logic that socketio provides without doing a full page refresh. Yes I understand the parse error issue and the recovery attempt via refresh (and we may keep that) but blindly refreshing the page without the server even being alive can result in the situation I've described.

What I was thinking of including was a call for HEAD to the server. If the server is alive, permit the refresh. Otherwise keep trying.

Additionally, utilise the page lifecycle events to do a quicker recovery: https://developer.chrome.com/docs/web-platform/page-lifecycle-api#new_features_added_in_chrome_68

Mostly, I want to avoid hitting "ERR_NAME_NOT_RESOLVED" at which point every dashboard requires manual intervention.

colinl commented 1 month ago

Understood. I don't know whether it is a further complication if a proxy is involved. For example I use Cloudflare Zero Trust on some systems. I am not sure what happens on those if the server is not available.

I want to avoid hitting "ERR_NAME_NOT_RESOLVED" at which point every dashboard requires manual intervention.

This is not necessarily true as at least some browsers automatically recover when I stop the server and then restart it, though I am not sure that is the same error. It recovers with Edge on a PC and Chrome on an old Android tablet. Even so, it agree it would be much better to avoid the situation in the first place.

Is whatever mechanism is used in D1 relevant here?

FlowFuse / node-red-dashboard