dxos-deprecated / mesh

GNU Affero General Public License v3.0
0 stars 1 forks source link

Network stack diagnostic tool (aka ping) #23

Open dboreham opened 3 years ago

dboreham commented 3 years ago

Requirement

Make a tool that either proves the DXOS network is functioning, or provides diagnostic data to allow any failure to be isolate to a specific layer or network component. Similar to the ping and traceroute tools for IP networks. Note that the capability may be provided as embedded functionality inside an application, and not as a stand alone tool, if that provides the best user experience.

Use cases

  1. User runs a DXOS SDK browser-hosted application (e.g. teamwork or arena) from a Kube. They interact with a collaborator who loaded their application from a different Kube. Application state updates from the collaborator seem to stall (and vice versa). Diagnose the cause.
  2. User runs the wire cli tool to spawn a bot on a kube. No confirmation of successful bot spawning is received. Diagnose the cause.

Network failure diagnosis

Diagnosis requires a white box analysis of the end-to-end path for each type of network interaction supported by the system:

Party invitation redemption

In order to successfully redeem an invitation the local node needs to establish a hypercore peer connection with a greeter node. Greeter nodes are associated with an invitation swarm key. Greeters can only be found if the local node has functioning signal service. Once a set of one or more potential greeters have been found a hypercore transport connection must be successfully established.

Party state replication

In order to replicate party state the local node needs to have one or more open and functioning hypercore peer connection. Peers are associated with the party swarm key. Peers can only be found if the local node has functioning signal service. Once a set of one or more potential greeters have been found a hypercore transport connection must be successfully established.

Bot spawn

In order to successfully spawn a bot the local node needs to establish a hypercore peer connection with the bot factory node. The bot factory has a unique swarm key. Greeters can only be found if the local node has functioning signal service. Once the bot factory node has been found a hypercore transport connection must be successfully established.

Failure modes

The network interactions above share a number of common failure modes:

  1. No TCP/IP network.
  2. No open connection to signal service.
  3. Signal service host name DNS failure.
  4. Signal service connection rejected.
  5. Signal service connection failed with 500 etc.
  6. Signal service connected but not responding.
  7. Signal service connected and responding but no available peers for swarm key.
  8. Signal service provides one or more peers for swarm key but no peer connections succeed.
  9. Peer connection fails due to WebRTC stack not functioning.
  10. Peer connection fails due to no response from peer.
  11. Peer connection fails due to ICE fail (e.g. TURN required but no TURN service available).
  12. Peer connection established but subsequently closed.
  13. Peer connection established but no data propagated.

    Diagnostic tool functionality

  14. Display signal service connection status (open/responding, not open/failed, last time succeeded etc).
  15. Display peer connection status (open, connecting, failed, etc).
  16. Heartbeat functionality on signal service connections.
  17. Heartbeat functionality on p2p WebRTC connections.
  18. Display last time data received from a peer.
  19. Display heartbeat status for each p2p connection.
  20. Display heartbeat status for signal service.
alexwykoff commented 3 years ago

https://github.com/dxos/mesh/issues/28 will likely land in a couple days. Let's take advantage of that as we build out this tool.