High Availability - Githubissues

lachlan2k commented 1 year ago

This issue is to discuss the ins-and-outs of making a highly-available Wag.

In general, it would be nice to have the ability to have Wag running on 2 different servers, for a highly available configuration, so if one server fails (or needs to be shut down), then operation can continue.

Since Wag relies on in-memory maps, it would quite difficult to support to an active-active configuration, so instead, allowing for an active-passive (failover) configuration would be nice.

During a failover condition, users would have to re-auth, but I think that's fine.

The first problem I see is the SQLite database cannot be easily shared, so perhaps one of the first steps could be allowing for other databases (Postgres, MySQL..?)

I don't think it should be Wag's responsibility to direct traffic to the different instances. Instead, the administrator should use features of their networking equipment to perform failover, or use something like keepalived with Vrrp.

NHAS commented 1 year ago

Yes, I agree this feature is probably quite a good idea. To this end we could enable support for postgres, and postgres pubsub in order to share state updates (such as new devices and whatnot).

My main concern is that if we were running this in active-passive, if data would be encrypted twice with the same key+nonce or other cryptographic issues. Im not entirely certain if using the same key on two different servers would get sad.

NHAS commented 1 year ago

I've decided that instead of using postgres as our single source of truth we'll instead use the raft consensus algorithm (specifically https://github.com/rqlite/rqlite if it exposes some raft components so we can send out own state updates).

This will simplify setup and not require people to run databases themselves.

As our target audience is <1000, I think sqlite3 + raft will probably suite the performance requirements.

NHAS commented 1 year ago

On second thoughts, it seems that raft is pretty ill-suited for this task. Considering that all reads and writes are transparently passed to the elected leader. While do would get database synchronization out of using raft, we'd still require either 3 or 5 nodes to effectively get a leadership election, and performance is quite effected.

I know Im going to regret saying this, but I think instead I'll look at writing something using an executor pattern where each action in wag is done in one place, then can be "replicated" to other cluster members. This requires significant rejigging of internal structures, and initial sync becomes an issue.

paulb-smartit commented 1 year ago

I wonder if a simple lsyncd of the files with an inotify on config.json change that does a wag reload would do the job just as well.

NHAS commented 1 year ago

Ah not quite the Ha I'm thinking of. While it will do full state sync periodically and at cluster start I want it to be able to have synced active users and firewall state.

I will admit this is a bit of a flight of fancy. I just think It would be fun.

You can achieve naive ha by just running two wag instances with the same config and DB. Just becomes painful to manage it, and users get logged out on fail over

NHAS commented 7 months ago

Slowly but surely we get closer to having the highly available wag of our dreams.

The current version that lives on unstable has migrated to using etcd as the underlying datastore away from sqlite3.

This should soon be a path for us to use etcd as a way of syncronising changes within in the wireguard/ebpf firewall state which will allow full HA.

Things left to do:

[x] Work out when node has been disconnected from cluster, or when cluster is unhealthy
[x] Create synchronization mechanism off of etcd watch
[ ] Testing the hell out of it

NHAS commented 7 months ago

More things to add to the pile, mostly UI so its causing me to grind to a halt:

[x] Add all authentication settings to general settings page
[ ] ~Deal with conflicts in key revision when editing (i.e if two users edit the same key at the same time we can tell who did it first and then tell the slower one that stuff has changed)~
[x] Write Cluster UI to add/remove/promote nodes and get general health
[ ] Indicate to a user when committing a change has failed or cannot sync to reality on node/s
[x] Add authentication and transport security between nodes

Stretch:

[ ] Make initial configuration web based

NHAS / wag

High Availability #24