MatrixAI / Polykey-CLI

Polykey CLI - Open Source Decentralized Secret Sharing System for Zero Trust Workflows
https://polykey.com
GNU General Public License v3.0
6 stars 3 forks source link

Node crashes when internet drops out #115

Open tegefaulkes opened 9 months ago

tegefaulkes commented 9 months ago

Describe the bug

I've found that whenever the network cuts out the node crashes. This is caused by the js-quic QUICSocket throwing an error on send when the network is down. Since we're treating all socket errors as critical failures we're just allowing the error to bubble up.

Ideally we handle this as a temp failure and not crash.

[nix-shell:~/workspace/Polykey-CLI]$  npm run polykey -- agent start -np tmp/sedfg

> polykey-cli@0.1.3 polykey
> ts-node src/polykey.ts agent start -np tmp/sedfg

✔ Enter new password … ***
✔ Confirm new password … ***
(node:347627) [DEP0112] DeprecationWarning: Socket.prototype._handle is deprecated
(Use `node --trace-deprecation ...` to show where the warning was created)
pid             347627
nodeId          vovracdc7ctp21pam7enjo8agsm2jrfptjs5f6l83tm8nmq997cm0
clientHost      ::1
clientPort      46181
agentHost       ::
agentPort       53431
recoveryCode    REDACTED
ErrorQUICClientInternal: Failed to send data on the QUICSocket

To Reproduce

  1. Start a Polykey node.
  2. Disconnect from the network by disabling wifi and disconnecting any Ethernet cables.
  3. Node should crash

Expected behavior

I think everything should work as normal but connections and streams should timeout if the network is down for longer than it's timeout. So send failures due to network should be treated as a packet drop.

Platform (please complete the following information)

tegefaulkes commented 5 months ago

I've done some work addressing this, I just need to re-check if it's still an issue.

tegefaulkes commented 5 months ago

I just ran a manual test where I turned off wifi and unplugged the wired connection. I observed that

  1. The agent did not crash
  2. the active connections timed out after a while.
  3. The seed connections were re-made after re-connecting the wired connection

I did this twice without the node crashing. I'll do another quick test with switching networks.

tegefaulkes commented 5 months ago

When switching between the wifi to my mobile tethered hotspot I ended up crashing with the following.

❯ npm run polykey -- agent start -np ./tmp/test1

> polykey-cli@0.4.1 polykey
> ts-node src/polykey.ts agent start -np ./tmp/test1

✔ Please enter the password … ***
(node:3907665) [DEP0112] DeprecationWarning: Socket.prototype._handle is deprecated
(Use `node --trace-deprecation ...` to show where the warning was created)
pid             3907665
nodeId          voic41031s00o9rkq9tqekb58on1hfv9hvd31btegppml0pj1seg0
clientHost      ::1
clientPort      33769
agentHost       ::
agentPort       33496
(node:3907665) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
Error: send ENETUNREACH ::ffff:3.145.86.40:1314

Not exactly sure what's happening here. I'm guessing we're either using a bad IP address or possibly with the wifi turned off the network device starts returning ENETUNREACH when trying to connect.

Also, switching between networks is very similar to a dropout, it's far from seamless. But to handle seamless network switching We'd likely have to implement the paths support in quiche.

CMCDragonkai commented 5 months ago

What do you mean by paths support?

tegefaulkes commented 5 months ago

https://docs.rs/quiche/latest/quiche/struct.Connection.html#method.paths_iter

Quiche can make a connection to a peer with a single address, but in scenarios where we have multiple interfaces, or roaming between networks quiche switch between them seamlessly as separate paths.

Basically just allows the same connection to work with multiple network paths to the peer.

amydevs commented 5 months ago

Another thing to note also, is that js-mdns does not currently handle adding or removal of network interfaces. Though it is not the cause of the above issue.