The current design/plan uses Prometheus to obtain metrics regarding end-to-end latency and bandwidth info. and the availability in remote compute nodes. This proposal seeks to decentralize the system in the event that Prometheus is unreachable. The trivial fallback is for clients to perform the the network measurements themselves (which is implicitly already being done due to libp2p's nodes pinging each peer). However this can lead to major waste of bandwidth if multiple clients within the same node are measuring peers in the same remote node.
As a middleground, the LCAs in each compute node can perform network measurements between each other and query compute resource availability from other LCAs. Clients can query their local LCA to obtain the relevant metrics to a specific target LCA.
This takes the responsibility of performing network measurements out of the hands of individual clients, saving bandwidth.
The LCA should have a P2P endpoint that returns the available compute information.
To prevent too much traffic as the system scales, the LCA can retain metrics on the n nearest LCAs (configurable)
Obviously this set will change as the topology changes due to mobility, network conditions, and compute availability. e.g. If all n nearest LCAs represent nodes with no available compute, then n can be dynamically increased. Likely need a more thorough plan or heuristic here.
If the LCA within a local node happens to be down, clients can fall back to performing measurements themselves
We can potentially even support both Prometheus and LCA measurements to further save bandwidth, i.e.:
LCA queries Prometheus to obtain node-to-node latency and bandwidth info (when clients within the node queries the LCA, the LCA just returns the info it obtained from Prometheus)
The LCA only starts its own measurements if it cannot contact Prometheus, which in most cases would also mean the clients within the same node won't be able to reach Prometheus
The current design/plan uses Prometheus to obtain metrics regarding end-to-end latency and bandwidth info. and the availability in remote compute nodes. This proposal seeks to decentralize the system in the event that Prometheus is unreachable. The trivial fallback is for clients to perform the the network measurements themselves (which is implicitly already being done due to libp2p's nodes pinging each peer). However this can lead to major waste of bandwidth if multiple clients within the same node are measuring peers in the same remote node.
As a middleground, the LCAs in each compute node can perform network measurements between each other and query compute resource availability from other LCAs. Clients can query their local LCA to obtain the relevant metrics to a specific target LCA.