Closed knolleary closed 9 months ago
Witnessed very similar issue today doing verifications between device-staging
[AGENT] 09/01/2024 15:21:14 [info] Editor tunnel closed code=1005 reason=
[AGENT] 09/01/2024 15:21:14 [info] Connecting editor tunnel to wss://staging/api/v1/devices/abcdefg/editor/comms/ffde_EjxwLzQyM9_2Pa5ZSc1ujoUIeOK7ysqVw
C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\node_modules\ws\lib\websocket.js:442
throw new Error('WebSocket is not open: readyState 0 (CONNECTING)');
^
Error: WebSocket is not open: readyState 0 (CONNECTING)
at WebSocket.send (C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\node_modules\ws\lib\websocket.js:442:13)
at WebSocket.<anonymous> (C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\lib\editor\tunnel.js:105:46)
at WebSocket.emit (node:events:527:28)
at Receiver.receiverOnMessage (C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\node_modules\ws\lib\websocket.js:1184:20)
at Receiver.emit (node:events:527:28)
at Receiver.dataMessage (C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\node_modules\ws\lib\receiver.js:541:14)
at Receiver.getData (C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\node_modules\ws\lib\receiver.js:459:17)
at Receiver.startLoop (C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\node_modules\ws\lib\receiver.js:158:22)
at Receiver._write (C:\Users\Stephen\repos\github\flowfuse\dev-env\packages\device-agent\node_modules\ws\lib\receiver.js:84:10)
at writeOrBuffer (node:internal/streams/writable:389:12)
While not identical it backs up the notion we need additional error handling in the agents EditorTunnel.socket
callbacks
I have confirmed that when a device has an open tunnel and the core application is updated/restarted, this can occur.
I can freely recreate this by killing the FF core (no sigint & thus no clean up / no WS close event).
Containment PR incoming
verified against staging.
3x devices tunnel connected, staging restarted
NOTE: the tunnel is still closed (due to core shutting down) but the devices survived/no longer crashed with readyState error.
NOTE: the tunnel is still closed (due to core shutting down) but the devices survived/no longer crashed with readyState error.
@Steve-Mcl I opened an issue yesterday after testing this locally as it looks like the device agent is meant to retry with a back off, but only tries once (with only 0.5 seconds delay). https://github.com/FlowFuse/device-agent/issues/222
@hardillb Yeah, i seen this (and why i wrote that comment as a nod to your finding) however, I think that particular retry logic is for recovering minimal network drop outs Ben.
We have not written any code to recover a core restart (as new tokens would need to be generated and passed back to the agent)
Current Behavior
Reported by a user. Happened once, then reconnected without further issue.
Expected Behavior
No response
Steps To Reproduce
Current unknown
Environment