Closed SamuelWei closed 7 months ago
Conversation in matrix
@defnull If a server cannot be reached from the LB, it should immediately be removed from the rotation for create or join calls, but all other API calls can still be tried for a certain time. They simply fail, but it is better to return an error to the client so that it can try again than to lie. After a certain time, you would then give up and consider all meetings of this server as 'zombies'. In other words, allow new meetings with the same ID but on other servers. I would leave these zombies in the database until either a considerable amount of time has passed or the broken server is accessible again and you can send the end-meeting calls.
@SamuelWei The question is then, how do I deal with it if a server is really really gone. Then nobody can start this room for time X, because it is still marked as running, right?
@defnull Consideration. The flow is "online" -> "unresponsive" -> "offline". You can configure how long you stay in the middle state (i.e. blocking API calls, but also not allowing new meetings with the same ID). I think one minute is enough. I would actually give up after one minute. Scalelite does this with a counter. I think the server is only really taken offline after 3x60 seconds. In the meantime, however, the server counts as online and fully operational, which I think is wrong. It should no longer receive new meetings.
The x-scale is marked „time in minutes“, yet the description reads „every api-call…“
The former makes sort of sense, if a server is down for „x“ Minutes, it is probably down/unreachable, if it‘s active for „y“ Minutes, it‘s sane again. The latter does not make so much sense, „x“ api-calls can be made in a couple milliseconds, so even a minor network-outage is likely to mark the server as completely failed.
@EmmyGraugans That's true, we currently call the server every minute, however this could be adjusted, you would then also have to adjust the upper and lower limits
Conversation in matrix
@defnull Suggested to split health score into an error and recover counter
I dumped a bit more thought into it and came up with a slightly more robust solution that also gracefully handles flapping servers that would otherwise never reach the 'offline' state and linger in 'unhealthy' state forever. The main idea is that failures stick more than successes. I'll try to explain:
Each server has two counters (named error_count
and recover_count
) and three visible states (green/online
, yellow/unhealthy
and red/offline
). The visible state can be derived from the counters, they do not not need to be stored separately. We also need two global threshold values (named healthy_threshold
and unhealthy_threshold
) that control how fast servers are supposed to fail or recover.
On a failed health check the recover_count
is immediately set to zero. If the server is not already in red/offline
state, we increase error_count
too. If error_count
is still below unhealthy_threshold
after that, the new state is yellow/unhealthy
and no new meetings are created on this server. Existing meetings are still served. Join calls may fail, but that is still better than creating the same meeting again on a different server and ending up with a split meeting. As soon as error_count
reaches unhealthy_threshold
, the new state switches to red/offline
and all existing meetings are marked as zombies.
On a successful health check for a server that is not already in green/online
state, we increase recover_count
, but we do not reset error_count
. Past errors will stick around for now. Since we can reach the server, we try to end all zombie meetings if there are any, but ignore any errors while doing so. If we successfully ended a meeting or got a NotFound error back, we can remove that zombie meeting from the database. Only if recover_count
reaches healthy_threshold
we reset error_count
to zero and change the server state to green/online
again.
A server is green/online
if recover_count == healthy_threshold
, red/offline
if error_count == unhealthy_threshold
and yellow/unhealthy
otherwise.
This algorithm has some nice properties:
yellow/unhealthy
and quickly reach red/offline
, even if some requests still succeed. No server will stay in yellow/unhealthy
state forever.red/offline
state until it passed multiple health checks in a row and then immediately switch to green/online
. Unstable servers stay in red/offline
state as they should.red/offline
state, the meetings are preserved and do not need to be re-created.Pseudo-code:
HEALTHY_THRESHOLD = UNHEALTHY_THRESHOLD = 3
class Server:
enabled = True
recover_counter = 0
error_counter = 0
meetings = []
@property
def health(self):
if self.recover_counter == HEALTHY_THRESHOLD:
return 'green'
elif self.error_counter == UNHEALTHY_THRESHOLD:
return 'red'
else:
return 'yellow'
@property
def is_healthy(self):
return self.health == 'green'
@property
def is_avaiable_for_new_meetings(self):
return self.enabled and self.health == 'green'
@property
def allow_user_join_existing_meetings(self):
return self.health != 'red'
def poll(self):
# TODO: Do some actual health checks here
healthy = random.choice([True, False])
if healthy and self.recover_counter < HEALTHY_THRESHOLD:
# Try to end zombie meetings if there are any
for meeting in self.meetings:
if meeting.zombie:
meeting.end(ignore_missing=True)
# Count successes, do not forget errors yet
self.recover_counter += 1
# Only forget errors after a full recovery
if self.recover_counter == HEALTHY_THRESHOLD:
self.error_counter = 0
elif not healthy:
# Reset successes
self.recover_counter = 0
# If not already red/offline, count errors
if self.error_counter < UNHEALTHY_THRESHOLD:
self.error_counter += 1
# Mark all meetings as zombies as soon as the server switches to red
if self.error_counter == UNHEALTHY_THRESHOLD:
for meeting in self.meetings:
meeting.zombie = True
@defnull I think there is something missing during recovery.
Imagine a server that is currently offline (recover_counter = 0; error_counter = 3) has a successfull api call -> recover_counter = 1; error_counter = 3.
Next request fails, this will not do anything, as the server is not health but error_counter is not below the threshold. If it continues like: success, fail, success,... The server will ulimatly become online again.
However I think the server should stay offline, right?
Yes, I fixed my description and code example.
Describe the bug In case of temporary network problems, PILOS cannot reach the BigBlueButton server hosting the meeting for a room. Therefore, it changes the status of the server to offline. It also terminates the meeting in the database.
When the network problem is fixed, the server is brought back online, but the meeting is still marked as ended. This allows the room moderators to start a new meeting for the room if there is a persistent problem with the server. However, it will never be possible to re-enter the BBB meeting that is still in progress.
Expected behavior PILOS should be more robust in dealing with network problems. If a new meeting has not been created for the room, the meeting should be restored to a running state after the connection problem has been resolved.
During a connectivity problem and within a maximum amount of time, moderators should be able to stop waiting for an automatic recovery and instead create a new meeting, knowing that this may result in a permanently disconnected BBB meeting.