bosun-monitor / bosun

Time Series Alerting Framework
http://bosun.org
MIT License
3.4k stars 495 forks source link

First cluster implementation #2441

Closed svagner closed 4 years ago

svagner commented 4 years ago

Description

As solution for issue https://github.com/bosun-monitor/bosun/issues/2443

Same as https://github.com/bosun-monitor/bosun/pull/2345 with solving conflicts and some new api endpoints

Were added new api endpoints: POST /api/cluster/recover_cluster Api endpoint Is used to manually force a new configuration in order to recover from a loss of quorum where the current configuration cannot be restored, such as when several servers die at the same time. This works by reading all the current state for this server, creating a snapshot with the supplied configuration, and then truncating the Raft log. This is the only safe way to force a given configuration without actually altering the log to insert any new entries, which could cause conflicts with other servers with a different state.

WARNING! This operation implicitly commits all entries in the Raft log, so in general, this is an extremely unsafe operation. If you've lost your other servers and are performing a manual recovery, then you've also lost the commit information, so this is likely the best you can do, but you should be aware that calling this can cause Raft log entries that were in the process of being replicated but not yet be committed to be committed.

Example:

$ curl -s 127.0.0.1:8071/api/cluster/status | jq .
{
  "State": "Candidate",
  "Nodes": [
    {
      "Address": "127.0.0.1:10002",
      "State": "Follower"
    },
    {
      "Address": "127.0.0.1:10014",
      "State": "Follower"
    }
  ],
  "Stats": {
...
  }
}
$ curl -XPOST 127.0.0.1:8071/api/cluster/recover_cluster -d '{"members": [{"address": "127.0.0.1:10002"}]}'
{
  "State": "Leader",
  "Nodes": [
    {
      "Address": "127.0.0.1:10002",
      "State": "Leader"
    }
  ],
  "Stats": {
...
  }
}

POST /api/cluster/change_master - move leadership to another node in cluster Example:

$ curl -s 127.0.0.1:8071/api/cluster/status | jq .
{
  "State": "Leader",
  "Nodes": [
    {
      "Address": "127.0.0.1:10002",
      "State": "Leader"
    },
    {
      "Address": "127.0.0.1:10014",
      "State": "Follower"
    }
  ],
  "Stats": {
    ...
  }
}
$ curl -XPOST 127.0.0.1:8072/api/cluster/change_master -d '{"id": "127.0.0.1:10014", "address": "127.0.0.1:10014"}'
{"status":"error","error":"cannot transfer leadership to itself"}
$ curl -XPOST 127.0.0.1:8072/api/cluster/change_master -d '{"id": "127.0.0.1:10002", "address": "127.0.0.1:10002"}'
{"status":"error","error":"node is not the leader"}
$ curl -XPOST 127.0.0.1:8071/api/cluster/change_master -d '{"id": "127.0.0.1:10014", "address": "127.0.0.1:10014"}'
{"status":"ok"}
$ curl 127.0.0.1:8071/api/cluster/status | jq .
{
  "State": "Follower",
  "Nodes": [
    {
      "Address": "127.0.0.1:10002",
      "State": "Follower"
    },
    {
      "Address": "127.0.0.1:10014",
      "State": "Leader"
    }
  ],
  "Stats": {
...
  }
}

Type of change

From the following, please check the options that are relevant.

How has this been tested?

Checklist:

muffix commented 4 years ago

Hi @svagner, thanks a lot for the contribution. 🙌 Is it worth creating an issue first to discuss the approach and keep this PR for the technical discussion? Can I also ask you to please use the template that's generated when you open a PR and fill in the details? This makes it easier for us to review bigger changes and contributions like this one. That'd be much appreciated.

svagner commented 4 years ago

Hi @svagner, thanks a lot for the contribution. Is it worth creating an issue first to discuss the approach and keep this PR for the technical discussion? Can I also ask you to please use the template that's generated when you open a PR and fill in the details? This makes it easier for us to review bigger changes and contributions like this one. That'd be much appreciated.

I'll create issue then. We already have a couple of them, but there was no movement about clustering. Also, there is some discussion in prev MR, but anyway it is not bad to have a separate issue for discussion

svagner commented 4 years ago

New implementation is in https://github.com/bosun-monitor/bosun/pull/2472