High availability - Githubissues

mostafa commented 1 year ago

This is to ensure HA of GatewayD by running a cluster of machines that can connect together and serve clients. So, plan and create tickets for all the following features and start implementing them.

[ ] Distributed state management (using gossip protocols)
[ ] High-availability
[ ] Fail-over (fault detection)
[ ] Clustering
[ ] Service mesh
[ ] Control plane?

Resources

sinadarbouy commented 3 weeks ago

For this issue, I think we can solve it by using github.com/hashicorp/raft (as mentioned in the issue description) to handle the state and coordination between nodes.

Here’s how I see it working:

Expose a Raft Port: We’ll need to open up an extra port for Raft. Then, during startup, all the nodes can connect and form a Raft cluster.
Single Raft Cluster for All Config Groups: Instead of having a separate Raft cluster for each configuration group, we can just have one for all of them. It should simplify things and reduce overhead.
Handling Stateful Parameters: We can store stateful parameters as key-value pairs, similar to how we handle it in the Redis plugin(configurationGroup-Configurationblock-Key). Raft will help ensure all nodes stay in sync with these values.
Fetching State Variables from Files: For things like connection counts, we can store them in a file and fetch them when needed. Since this usually happens in the OnOpen phase and during connection setup, performance shouldn’t be an issue.

With this approach, if we have three instances of GatewayD running, they can all receive requests, but they’ll rely on Raft to fetch the stateful variables through a voting process, ensuring everything stays consistent before creating connection between the client and DB.

If this approach sounds good, I can start working on it.

mostafa commented 3 weeks ago

After some investigation and the fact that Gossip protocol libraries are old and unmaintained, I think the go to approach is to use Raft, considering that Kafka also used it to move away from ZooKeeper. I think we should stick with simplicity and ease of use, as you also mentioned, rather than creating a Raft per tenant. We can also consider storing the state variables in SQLite or ObjectBox.

Let's create another ticket and link it to this one.

sinadarbouy commented 3 weeks ago

I checked again, and it turns out we don’t need to store our state in a file. HashiCorp Raft already uses BoltDB to handle the Raft logs for persistence and recovery. We can just use sync.Map to keep our state in memory since we’re only working with simple key-value data. Since we don’t have complex data, skipping a database like ObjectBox should be fine, as long as we rely on Raft for consistency and recovery.

gatewayd-io / gatewayd

High availability #169

Resources