blindsidenetworks / scalelite

Scalable load balancer for BigBlueButton.
GNU Affero General Public License v3.0
472 stars 248 forks source link

Scalelite creates meeting on a offline server #202

Closed bruno-berchielli closed 3 years ago

bruno-berchielli commented 4 years ago

Hello,

Thanks for this brilliant application!

Our front-end sends a create everytime before a join request. But we are facing the current scenario:

  1. Users create & join a meeting;
  2. All users left the meeting (it effectively ended);
  3. Server is disable (due to schedule uptime/downtime) and Scalelite polls and identify that this specific server is in fact, offline. There are still others servers available (online).
  4. 30mins later new users create & try to join a meeting (same ID as step 1) but Scalelite sends the request to the offline server. Users are unable to join.

Shouldn't Scalelite create the new meeting (with same previous ID) on another server that is avaiable?

I don't understand why it would send a join/create request to a server that is known to be offline.

Thank you

bruno-berchielli commented 4 years ago

Let me rephrase the question: as soon as a server is detected to be offline, can Scalelite automatically "Panic" that server?

or

Make Scalelite pool meetings from the BBB servers and once it identifies that a certain meeting is no longer happening, no longer attached that meeting ID to a specific server? (so it can be created again in the server with lowest load)

paul1278 commented 4 years ago

Which version of scalelite are you using?

as soon as a server is detected to be offline, can Scalelite automatic "Panic" that server?

With the changes of v1.0.2 this should be implemented here as soon as a server is unhealthy

bruno-berchielli commented 4 years ago

Which version of scalelite are you using?

as soon as a server is detected to be offline, can Scalelite automatic "Panic" that server?

With the changes of v1.0.2 this should be implemented here as soon as a server is unhealthy

Yeah, I'm using v1.0.2 will try again and will report right back to you

bruno-berchielli commented 4 years ago
ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all 
D, [2020-05-09T21:12:29.205584 #532] DEBUG -- : Polling servers
D, [2020-05-09T21:12:29.205703 #532] DEBUG -- : Polling meetings
D, [2020-05-09T21:12:29.222834 #532] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:12:29.224178 #532] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:12:39.224548 #532]  WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: execution expired
I, [2020-05-09T21:12:39.224774 #532]  INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:12:39.224829 #532] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:12:39.225658 #532]  WARN -- : Failed to check meeting id=comercial-blox status: execution expired
D, [2020-05-09T21:12:39.226221 #532] DEBUG -- : Polling Meeting id=dev-blox on Server id=1a977411-2251-462c-9c2e-7fd0684c383c
I, [2020-05-09T21:12:39.249045 #532]  INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:12:39.249791 #532] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
I, [2020-05-09T21:12:39.250786 #532]  INFO -- : Meeting id=dev-blox on Server id=1a977411-2251-462c-9c2e-7fd0684c383c has ended
D, [2020-05-09T21:12:39.251696 #532] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:12:49.251219 #532]  WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:12:49.251438 #532]  INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable
W, [2020-05-09T21:12:49.252676 #532]  WARN -- : Failed to check meeting id=papo-de-boteco-blox status: execution expired
ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all 
D, [2020-05-09T21:13:15.061603 #543] DEBUG -- : Polling meetings
D, [2020-05-09T21:13:15.061731 #543] DEBUG -- : Polling servers
D, [2020-05-09T21:13:15.079340 #543] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:13:15.079519 #543] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:13:25.081618 #543]  WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: execution expired
I, [2020-05-09T21:13:25.081847 #543]  INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:13:25.081894 #543] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:13:25.082774 #543]  WARN -- : Failed to check meeting id=comercial-blox status: execution expired
D, [2020-05-09T21:13:25.083343 #543] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
I, [2020-05-09T21:13:25.099241 #543]  INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:13:25.099304 #543] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
W, [2020-05-09T21:13:35.088356 #543]  WARN -- : Failed to check meeting id=papo-de-boteco-blox status: execution expired
W, [2020-05-09T21:13:35.100330 #543]  WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:13:35.100527 #543]  INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable
ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all 
D, [2020-05-09T21:13:52.374683 #553] DEBUG -- : Polling meetings
D, [2020-05-09T21:13:52.374810 #553] DEBUG -- : Polling servers
D, [2020-05-09T21:13:52.392417 #553] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:13:52.392893 #553] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:14:02.395032 #553]  WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: execution expired
I, [2020-05-09T21:14:02.395268 #553]  INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:14:02.395338 #553] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:14:02.395986 #553]  WARN -- : Failed to check meeting id=comercial-blox status: execution expired
D, [2020-05-09T21:14:02.396777 #553] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
I, [2020-05-09T21:14:02.413953 #553]  INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:14:02.414037 #553] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
W, [2020-05-09T21:14:12.397879 #553]  WARN -- : Failed to check meeting id=papo-de-boteco-blox status: execution expired
W, [2020-05-09T21:14:12.415210 #553]  WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:14:12.415430 #553]  INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable

@paul1278 Can confirm that this is a bug.

By using bin/rake servers I can confirm that Scalelite updated the server to Offline (automatic polling). But when I use the front end, it still sends the create/join to the offline server. I executed this command more than 15 minutes after shuting down the server, enough time for several polling.

When executing bin/rake poll:all I can see several warnings such as: WARN -- : Failed to check meeting id=comercial-blox status: execution expired

For interpretation of the log: Those servers are OFFLINE e4308e31-da25-46d5-bb2a-93f37a5f49e1 and cada0889-c605-4a67-b173-c1d3be939f7d

This one is ONLINE 1a977411-2251-462c-9c2e-7fd0684c383c

For some reason, the servers did not "Panic" at all.

EDIT: some more logs with new errors

ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all
D, [2020-05-09T21:29:23.101052 #564] DEBUG -- : Polling meetings
D, [2020-05-09T21:29:23.101179 #564] DEBUG -- : Polling servers
D, [2020-05-09T21:29:23.118155 #564] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:29:23.118585 #564] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:29:23.158610 #564]  WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: internalError: Response did not include returncode
I, [2020-05-09T21:29:23.158870 #564]  INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:29:23.158916 #564] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:29:23.160298 #564]  WARN -- : Unexpected BigBlueButton error polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1: internalError: Response did not include returncode
D, [2020-05-09T21:29:23.161087 #564] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
I, [2020-05-09T21:29:23.176610 #564]  INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:29:23.176687 #564] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
W, [2020-05-09T21:29:23.180137 #564]  WARN -- : Unexpected BigBlueButton error polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1: internalError: Response did not include returncode
W, [2020-05-09T21:29:33.179122 #564]  WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:29:33.179328 #564]  INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable
farhatahmad commented 4 years ago

Hi @bruno-berchielli,

Can you double check and confirm that you're running the latest version? I can see the errors indicating the server is offline, which should be followed by a log saying that the server is unhealthy and then panicking that server. I don't see any of those logs which could mean that you're running an earlier version

ichdasich commented 4 years ago

I observed the same behavior on systems explicitly disabled with server:disable, and also showing disabled in server:status (no logs atm; Will reproduce during the next maintenance window). I am running 1.0.2;

This kind of makes my maintenance plan (take out of cluster, upgrade, put back in, next) flaky as well. ;-)

krim404 commented 4 years ago

can confirm, i have the same issue. EDIT: i paniced the server manually before, so it seems to be a bug in the software.

bruno-berchielli commented 4 years ago

@farhatahmad yes I'm sure I'm running the current eversion

The unhealthy followed by panicking never happned.

Thanks

farhatahmad commented 4 years ago

Gonna tag this as a bug. Do you guys have concrete steps to reproduce?

ichdasich commented 4 years ago

Not really. It is like 'start a meeting, be delegated to server X, stop meeting, do server:disable[id] for that server, start meeting in GL again'

ichdasich commented 4 years ago

Could this be related to #233?

farhatahmad commented 4 years ago

Yeah I believe you're correct @ichdasich. I can confirm that the poller does re-enable already disabled servers if they go offline and then come back online

farhatahmad commented 4 years ago

May potentially be fixed by #241.

Will wait until after the release for you guys to let me know if it happens again

vicman commented 4 years ago

I disabled and delete the server and continue asign it the meeteing to this server. depide that dissapear to server status

ichdasich commented 4 years ago

So... I still see this with 1.0.3. I start two meetings two nodes in the cluster, one per node. I disable one server. I stop the meeting on that server. I start the meeting again, still ends up on the same node.

farhatahmad commented 4 years ago

@ichdasich What you're describing in actually expected behaviour. There are some front-ends that send a create and join request for everytime a user tries to join a meeting. In the case where there is a meeting on a server, and that server gets disabled, we still want to allow users to join the meetings on that server. Now the only way to do that is to check if the meeting already exists (which it does) and send the create and join call to that server. If we don't do this, users joining a meeting on a disabled server would result in a new meeting being created on a different, enabled server.

There does seem to be an issue with meetings being created on offline servers but I'm still trying to track down the exact cause

ichdasich commented 4 years ago

@farhatahmad The case is slightly different from what you describe. I start the meeting, disable the server, stop the meeting (meeting count in status command goes to 0!) and then start the meeting anew, and it ends up on the disabled server.

I would expect that, after the meeting was fully stopped, it would end up on a new server when started again. Or is there some ghost-time a stopped meeting hangs around?

farhatahmad commented 4 years ago

There is a ghost time as Scalelite doesn't know the meeting has ended until the poller runs atleast once. Try the same test, but waiting ~60 seconds before creating the room again

bruno-berchielli commented 4 years ago

@farhatahmad the problem that I faced when I created this issue is exactly what @ichdasich described just now.

I know about the polling interval, but Scalelite simply didn't clear the meeting even after several hours that the meeting has ended and running poll:all command several times. I noticed that this occured when se server was marked as offline.

BTW, as a enhancement suggestion, I believe that Scalelite should treat an "offline" server differently than an "disable" server. Disable server = server is working but I don't want NEW meetings created there. Offline server = server is down and new/ongoing meetings should NOT be redirected there in any circunstance.

Thanks

ichdasich commented 4 years ago

Will give it a try during the next maintenance window.

farhatahmad commented 4 years ago

I know about the polling interval, but Scalelite simply didn't clear the meeting even after several hours that the meeting has ended and running poll:all command several times.

I think I might have an idea. The meetings that are being created on the offline server, were they explicitly ended (using the End Meeting button in BBB) before the server went offline? Or were they still running when the server went down?

BTW, as a enhancement suggestion, I believe that Scalelite should treat an "offline" server differently than an "disable" server. Disable server = server is working but I don't want NEW meetings created there. Offline server = server is down and new/ongoing meetings should NOT be redirected there in any circunstance.

That's actually the expected behaviour

ichdasich commented 4 years ago

Heho, I end the meeting with 'end meeting' as moderator, and it also disappears from the cluster overview, as mentioned above. But, again, I will double verify this again during the next maint window.

bruno-berchielli commented 4 years ago

Hello,

Would like to report that I´m not experiencing anymore the bug described in this issue. I was running 1.0.2 and now I´ve updated to 1.0.5 and everything is resulting in the expected behaviour. If the latest version also fixed this issue to you guys, I think we can close this.

Thank you