BBB-Meeting disconnected from PILOS during network issues

SamuelWei commented 1 year ago

Describe the bug In case of temporary network problems, PILOS cannot reach the BigBlueButton server hosting the meeting for a room. Therefore, it changes the status of the server to offline. It also terminates the meeting in the database.

When the network problem is fixed, the server is brought back online, but the meeting is still marked as ended. This allows the room moderators to start a new meeting for the room if there is a persistent problem with the server. However, it will never be possible to re-enter the BBB meeting that is still in progress.

Expected behavior PILOS should be more robust in dealing with network problems. If a new meeting has not been created for the room, the meeting should be restored to a running state after the connection problem has been resolved.

During a connectivity problem and within a maximum amount of time, moderators should be able to stop waiting for an automatic recovery and instead create a new meeting, knowing that this may result in a permanently disconnected BBB meeting.

SamuelWei commented 10 months ago

Conversation in matrix

@defnull If a server cannot be reached from the LB, it should immediately be removed from the rotation for create or join calls, but all other API calls can still be tried for a certain time. They simply fail, but it is better to return an error to the client so that it can try again than to lie. After a certain time, you would then give up and consider all meetings of this server as 'zombies'. In other words, allow new meetings with the same ID but on other servers. I would leave these zombies in the database until either a considerable amount of time has passed or the broken server is accessible again and you can send the end-meeting calls.

@SamuelWei The question is then, how do I deal with it if a server is really really gone. Then nobody can start this room for time X, because it is still marked as running, right?

@defnull Consideration. The flow is "online" -> "unresponsive" -> "offline". You can configure how long you stay in the middle state (i.e. blocking API calls, but also not allowing new meetings with the same ID). I think one minute is enough. I would actually give up after one minute. Scalelite does this with a counter. I think the server is only really taken offline after 3x60 seconds. In the meantime, however, the server counts as online and fully operational, which I think is wrong. It should no longer receive new meetings.

SamuelWei commented 10 months ago

UML Timing Diagram for API Call -> Server Health

Newly added servers are checked for their status, if API call was successfull, add with highest health score and online health
As soon as the first API call fails, there health score is set to -1, every other api failure is lowing it to the lowest limit (in this example -3). As soon as the health score is decreased the server is considered 'unhealthy'. If the score reaches its lower limit, it is consided 'offline'
On the first successfull api call the score is set to 1, every other successfull api call is increasing the score up to its upper limit (in this example 3). As soon as the upper limit is reached, the server is consided 'online'.

ServerState

Server Health

Online: New meeting as created on this server, it is also used to collect data for attandace, statistics, etc.
Unhealthy: No new meetings are started on this server, however we still try to collect data for attandace, statistics, etc. Meetings on this server are still shown as running
Offline: The server is marked as offline in the database

Server Status

Enabled: Use the Server Health to decide what to do
Cordon: The server should not be used for sheduling (no new meetings), as soon as all meetings are finished, change to disabled, still try to collect data for attandace, statistics, etc.
Disabled: No api calls should be made to the server, however still allow join meetings, do not collect data for attandace, statistics, etc.

Additional actions

Panic: Send endMeeting request to all running meetings and change status to disabled

Meeting behavior on connection errors

During server 'unhealthy': Show message connection to server is lost, trying to re-connect
If the server is back online, re-connect was not canceled and meeting is still running on the server, change meeting to running again
Server is offline: All meetings are marked as ended in the database, do not re-connect

Additional actions

Cancel Re-Connect (moderators only): Show button to cancel re-connect and start new meeting (warning: this may result in a disconnected meeting). Set the meeting end time of the old meeting to the current time, set a flag to force-end meeting. Try to connect the server for x-minutes to try to end old meeting

EmmyGraugans commented 10 months ago

The x-scale is marked „time in minutes“, yet the description reads „every api-call…“

The former makes sort of sense, if a server is down for „x“ Minutes, it is probably down/unreachable, if it‘s active for „y“ Minutes, it‘s sane again. The latter does not make so much sense, „x“ api-calls can be made in a couple milliseconds, so even a minor network-outage is likely to mark the server as completely failed.

SamuelWei commented 10 months ago

@EmmyGraugans That's true, we currently call the server every minute, however this could be adjusted, you would then also have to adjust the upper and lower limits

SamuelWei commented 9 months ago

Conversation in matrix

@defnull Suggested to split health score into an error and recover counter

defnull commented 9 months ago

I dumped a bit more thought into it and came up with a slightly more robust solution that also gracefully handles flapping servers that would otherwise never reach the 'offline' state and linger in 'unhealthy' state forever. The main idea is that failures stick more than successes. I'll try to explain:

Each server has two counters (named error_count and recover_count) and three visible states (green/online, yellow/unhealthy and red/offline). The visible state can be derived from the counters, they do not not need to be stored separately. We also need two global threshold values (named healthy_threshold and unhealthy_threshold) that control how fast servers are supposed to fail or recover.

On a failed health check the recover_count is immediately set to zero. If the server is not already in red/offline state, we increase error_count too. If error_count is still below unhealthy_threshold after that, the new state is yellow/unhealthy and no new meetings are created on this server. Existing meetings are still served. Join calls may fail, but that is still better than creating the same meeting again on a different server and ending up with a split meeting. As soon as error_count reaches unhealthy_threshold, the new state switches to red/offline and all existing meetings are marked as zombies.

On a successful health check for a server that is not already in green/online state, we increase recover_count, but we do not reset error_count. Past errors will stick around for now. Since we can reach the server, we try to end all zombie meetings if there are any, but ignore any errors while doing so. If we successfully ended a meeting or got a NotFound error back, we can remove that zombie meeting from the database. Only if recover_count reaches healthy_threshold we reset error_count to zero and change the server state to green/online again.

A server is green/online if recover_count == healthy_threshold, red/offline if error_count == unhealthy_threshold and yellow/unhealthy otherwise.

This algorithm has some nice properties:

A failing server will immediately switch to yellow/unhealthy and quickly reach red/offline, even if some requests still succeed. No server will stay in yellow/unhealthy state forever.
A recovering server will stay in red/offline state until it passed multiple health checks in a row and then immediately switch to green/online. Unstable servers stay in red/offline state as they should.
If the server recovers before it reaches the red/offline state, the meetings are preserved and do not need to be re-created.

Pseudo-code:


HEALTHY_THRESHOLD = UNHEALTHY_THRESHOLD = 3

class Server:
  enabled = True
  recover_counter = 0
  error_counter = 0
  meetings = []

  @property
  def health(self):
    if self.recover_counter == HEALTHY_THRESHOLD:
      return 'green'
    elif self.error_counter == UNHEALTHY_THRESHOLD:
      return 'red'
    else:
      return 'yellow'

  @property
  def is_healthy(self):
    return self.health == 'green'

  @property
  def is_avaiable_for_new_meetings(self):
    return self.enabled and self.health == 'green'

  @property
  def allow_user_join_existing_meetings(self):
    return self.health != 'red'

  def poll(self):
    # TODO: Do some actual health checks here
    healthy = random.choice([True, False])

    if healthy and self.recover_counter < HEALTHY_THRESHOLD:

      # Try to end zombie meetings if there are any
      for meeting in self.meetings:
        if meeting.zombie:
          meeting.end(ignore_missing=True)

      # Count successes, do not forget errors yet
      self.recover_counter += 1

      # Only forget errors after a full recovery
      if self.recover_counter == HEALTHY_THRESHOLD:
        self.error_counter = 0

    elif not healthy:

      # Reset successes
      self.recover_counter = 0

      # If not already red/offline, count errors
      if self.error_counter < UNHEALTHY_THRESHOLD:
        self.error_counter += 1

        # Mark all meetings as zombies as soon as the server switches to red
        if self.error_counter == UNHEALTHY_THRESHOLD:
          for meeting in self.meetings:
            meeting.zombie = True

SamuelWei commented 7 months ago

@defnull I think there is something missing during recovery.

Imagine a server that is currently offline (recover_counter = 0; error_counter = 3) has a successfull api call -> recover_counter = 1; error_counter = 3.

Next request fails, this will not do anything, as the server is not health but error_counter is not below the threshold. If it continues like: success, fail, success,... The server will ulimatly become online again.

However I think the server should stay offline, right?

defnull commented 7 months ago

Yes, I fixed my description and code example.

THM-Health / PILOS