comrade-coop / apocryph

A decentralized compute marketplace for running pods securely and confidentially
https://apocryph.network/
GNU General Public License v3.0
37 stars 7 forks source link

Spike: Implement KV store on top of go-libp2p-raft #33

Closed branimirangelov closed 1 month ago

branimirangelov commented 1 month ago

Within the current PoC, implement "autonomous alike" application based on go-libp2p-raft library that is maintaining a KV store. The scope is to deploy it as end-user app and test it accross two separate dev clusters.

Important constraints:

  1. The application should be able to run externally (independently).
  2. The application should be packaged and installed with the tooling.
  3. The application should not require any privileged access.

References:

  1. GitHub of go-libp2p-raft: https://github.com/libp2p/go-libp2p-raft
  2. GitHub of sample KV store on top of HashiCorp raft: https://github.com/otoolep/hraftd
  3. Video from the auhtor of sample KV store: https://www.youtube.com/watch?v=8XbxQ1Epi5w
  4. GitHub of HashiCorp raft itself: https://github.com/hashicorp/raft

Note: This spike is part of the Autoscaler autonomous application effort.

revoltez commented 1 month ago

Minor note: it needs to be tested on 3 dev clusters at least, since 2 nodes dont create a Quorum which makes Raft halt. From Hashipcorp:

Lastly, there is the issue of updating the peer set when new servers are joining or existing servers are leaving. As long as a quorum of nodes is available, this is not an issue as Raft provides mechanisms to dynamically update the peer set. If a quorum of nodes is unavailable, then this becomes a very challenging issue. For example, suppose there are only 2 peers, A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add, or remove a node, or commit any additional log entries. This results in unavailability. At this point, manual intervention would be required to remove either A or B, and to restart the remaining node in bootstrap mode.

branimirangelov commented 1 month ago

In the current PoC, Kubernetes clusters are using public IP-based ingress. Given this setup, the libp2p transport is less practical compared to a simpler HTTP (gRPC) transport. Because of other reasons (e.g. remote attestation on application layer), the Base Protocol should include a basic (rudimentary) naming system, which will allow HTTP-based transport to discover public IPs, as the full DNS discovery will pertain to the autoscaler itself. The use of libp2p will become beneficial in the future when we introduce Apocryph nodes (Kubernetes clusters) that do not have public IPs for ingress.

The basic naming system will rely on IPFS (underlying DHT) to provide some basic Apocryph Node discovery that will enable the autoscaler to deploy itself on various nodes and generate list of its instances. In case hardware provider (or other entity) decides to launch Autoscaler instance and willing to join it to the Autscaler cluster, this instance will have to know at least to know the name (IP) of at least one Autoscaler instance to initate the negotiation process for joining the cluster.