THM-Health / PILOS

PILOS is an easy-to-use open source front-end for BigBlueButton servers with a built-in load balancer. Docker-Images: https://hub.docker.com/r/pilos/pilos
https://thm-health.github.io/PILOS/
GNU Lesser General Public License v2.1
51 stars 18 forks source link

BBB-Meeting disconnected from PILOS during network issues #652

Closed SamuelWei closed 7 months ago

SamuelWei commented 1 year ago

Describe the bug In case of temporary network problems, PILOS cannot reach the BigBlueButton server hosting the meeting for a room. Therefore, it changes the status of the server to offline. It also terminates the meeting in the database.

When the network problem is fixed, the server is brought back online, but the meeting is still marked as ended. This allows the room moderators to start a new meeting for the room if there is a persistent problem with the server. However, it will never be possible to re-enter the BBB meeting that is still in progress.

Expected behavior PILOS should be more robust in dealing with network problems. If a new meeting has not been created for the room, the meeting should be restored to a running state after the connection problem has been resolved.

During a connectivity problem and within a maximum amount of time, moderators should be able to stop waiting for an automatic recovery and instead create a new meeting, knowing that this may result in a permanently disconnected BBB meeting.

SamuelWei commented 10 months ago

Conversation in matrix

@defnull If a server cannot be reached from the LB, it should immediately be removed from the rotation for create or join calls, but all other API calls can still be tried for a certain time. They simply fail, but it is better to return an error to the client so that it can try again than to lie. After a certain time, you would then give up and consider all meetings of this server as 'zombies'. In other words, allow new meetings with the same ID but on other servers. I would leave these zombies in the database until either a considerable amount of time has passed or the broken server is accessible again and you can send the end-meeting calls.

@SamuelWei The question is then, how do I deal with it if a server is really really gone. Then nobody can start this room for time X, because it is still marked as running, right?

@defnull Consideration. The flow is "online" -> "unresponsive" -> "offline". You can configure how long you stay in the middle state (i.e. blocking API calls, but also not allowing new meetings with the same ID). I think one minute is enough. I would actually give up after one minute. Scalelite does this with a counter. I think the server is only really taken offline after 3x60 seconds. In the meantime, however, the server counts as online and fully operational, which I think is wrong. It should no longer receive new meetings.

SamuelWei commented 10 months ago

UML Timing Diagram for API Call -> Server Health

ServerState

Server Health

Server Status

Additional actions

Meeting behavior on connection errors

Additional actions

EmmyGraugans commented 10 months ago

The x-scale is marked „time in minutes“, yet the description reads „every api-call…“

The former makes sort of sense, if a server is down for „x“ Minutes, it is probably down/unreachable, if it‘s active for „y“ Minutes, it‘s sane again. The latter does not make so much sense, „x“ api-calls can be made in a couple milliseconds, so even a minor network-outage is likely to mark the server as completely failed.

SamuelWei commented 10 months ago

@EmmyGraugans That's true, we currently call the server every minute, however this could be adjusted, you would then also have to adjust the upper and lower limits

SamuelWei commented 9 months ago

Conversation in matrix

@defnull Suggested to split health score into an error and recover counter

defnull commented 9 months ago

I dumped a bit more thought into it and came up with a slightly more robust solution that also gracefully handles flapping servers that would otherwise never reach the 'offline' state and linger in 'unhealthy' state forever. The main idea is that failures stick more than successes. I'll try to explain:

Each server has two counters (named error_count and recover_count) and three visible states (green/online, yellow/unhealthy and red/offline). The visible state can be derived from the counters, they do not not need to be stored separately. We also need two global threshold values (named healthy_threshold and unhealthy_threshold) that control how fast servers are supposed to fail or recover.

On a failed health check the recover_count is immediately set to zero. If the server is not already in red/offline state, we increase error_count too. If error_count is still below unhealthy_threshold after that, the new state is yellow/unhealthy and no new meetings are created on this server. Existing meetings are still served. Join calls may fail, but that is still better than creating the same meeting again on a different server and ending up with a split meeting. As soon as error_count reaches unhealthy_threshold, the new state switches to red/offline and all existing meetings are marked as zombies.

On a successful health check for a server that is not already in green/online state, we increase recover_count, but we do not reset error_count. Past errors will stick around for now. Since we can reach the server, we try to end all zombie meetings if there are any, but ignore any errors while doing so. If we successfully ended a meeting or got a NotFound error back, we can remove that zombie meeting from the database. Only if recover_count reaches healthy_threshold we reset error_count to zero and change the server state to green/online again.

A server is green/online if recover_count == healthy_threshold, red/offline if error_count == unhealthy_threshold and yellow/unhealthy otherwise.

This algorithm has some nice properties:

Pseudo-code:


HEALTHY_THRESHOLD = UNHEALTHY_THRESHOLD = 3

class Server:
  enabled = True
  recover_counter = 0
  error_counter = 0
  meetings = []

  @property
  def health(self):
    if self.recover_counter == HEALTHY_THRESHOLD:
      return 'green'
    elif self.error_counter == UNHEALTHY_THRESHOLD:
      return 'red'
    else:
      return 'yellow'

  @property
  def is_healthy(self):
    return self.health == 'green'

  @property
  def is_avaiable_for_new_meetings(self):
    return self.enabled and self.health == 'green'

  @property
  def allow_user_join_existing_meetings(self):
    return self.health != 'red'

  def poll(self):
    # TODO: Do some actual health checks here
    healthy = random.choice([True, False])

    if healthy and self.recover_counter < HEALTHY_THRESHOLD:

      # Try to end zombie meetings if there are any
      for meeting in self.meetings:
        if meeting.zombie:
          meeting.end(ignore_missing=True)

      # Count successes, do not forget errors yet
      self.recover_counter += 1

      # Only forget errors after a full recovery
      if self.recover_counter == HEALTHY_THRESHOLD:
        self.error_counter = 0

    elif not healthy:

      # Reset successes
      self.recover_counter = 0

      # If not already red/offline, count errors
      if self.error_counter < UNHEALTHY_THRESHOLD:
        self.error_counter += 1

        # Mark all meetings as zombies as soon as the server switches to red
        if self.error_counter == UNHEALTHY_THRESHOLD:
          for meeting in self.meetings:
            meeting.zombie = True
SamuelWei commented 7 months ago

@defnull I think there is something missing during recovery.

Imagine a server that is currently offline (recover_counter = 0; error_counter = 3) has a successfull api call -> recover_counter = 1; error_counter = 3.

Next request fails, this will not do anything, as the server is not health but error_counter is not below the threshold. If it continues like: success, fail, success,... The server will ulimatly become online again.

However I think the server should stay offline, right?

defnull commented 7 months ago

Yes, I fixed my description and code example.