TwiN / gatus

⛑ Automated developer-oriented status page
https://gatus.io
Apache License 2.0
6.36k stars 424 forks source link

High availability mode #176

Open TwiN opened 3 years ago

TwiN commented 3 years ago

This feature would allow more than one replica of Gatus with the exact same configuration to coexist by leveraging leader election through the new postgres storage.type.

Programmatically, this is how I envision it to work:

  1. First instance of Gatus, henceforth G1, starts.
  2. G1 tries to acquire lock by querying the new instance table in the Postgres database.
  3. Because the row specifying whether an instance has claimed the role of leader does not exist yet, G1 creates a row with the column label set to default, the role set to LEADER and the last_heartbeat set to CURRENT_TIMESTAMP.
  4. G1 is now the leader, therefore it begins monitoring the services configured.
  5. Every minute, G1 updates the timestamp in the Postgres database.
  6. Second instance of Gatus, henceforth G2, starts.
  7. G2 tries to acquire the writer lock by querying the instance table in the Postgres database for the label default and the role LEADER.
  8. G2 fails to acquire the lock, because another instance has already acquired it and the last_heartbeat timestamp is within the past 5 minutes. This 5 minutes shall be defined as time until reelection.
  9. G2 tries to acquire the writer lock every 2 minutes.
  10. Now, let's assume that G1 runs into an issue and crashes.
  11. G1 restarts, tries to acquire the lock, but as documented by step 8, it fails.
  12. 5 minutes goes by and the time for reelection has come, after which either G1 or G2 will grab the lock.

During this entire time, both G1 and G2 can read from the database, and therefore handle HTTP requests. The only restriction is that no more than one leader for one label can write at any given time.

distributed:
  mode: HA
  label: default

The parameter distributed.label is optional, and will default to the value default.

Why do we need a label?

This will be needed for #64 -- basically, let's say you wanted to deploy Gatus in 3 isolated environments which all have access to the postgres database, let's call them alpha, bravo and charlie. Of course, each environment has their own set of services to monitor.

You'd use the label to differentiate these environments and allow one leader per environment to push their data in the database, all while allowing each separate environment to be highly available.

Requirements:

guillomep commented 3 years ago

Could you make HA available without usage of a database ?

If we know by advance the endpoint (IP) of all gatus, we could simply list them in the configuration and they can elect a leader by talking to each other. One of known algorithm to do that is Raft https://raft.github.io/

BrianInAz commented 2 years ago

Could you make HA available without usage of a database ?

If we know by advance the endpoint (IP) of all gatus, we could simply list them in the configuration and they can elect a leader by talking to each other. One of known algorithm to do that is Raft https://raft.github.io/

I think an easier/quicker path to HA might be to model it after Prometheus and leverage Alertmanager to de-dupe alerts.

I've only taken a brief look so far but I think the existing custom notification will work with Alertmanager so long as the notification limiter is commented out.

beatkind commented 10 months ago

Hi there, I think this issue lost a bit of traction. Is there any other status on this topic, then what is described in this issue?