Closed bruno-berchielli closed 3 years ago
Let me rephrase the question: as soon as a server is detected to be offline, can Scalelite automatically "Panic" that server?
or
Make Scalelite pool meetings from the BBB servers and once it identifies that a certain meeting is no longer happening, no longer attached that meeting ID to a specific server? (so it can be created again in the server with lowest load)
Which version of scalelite are you using?
as soon as a server is detected to be offline, can Scalelite automatic "Panic" that server?
With the changes of v1.0.2 this should be implemented here as soon as a server is unhealthy
Which version of scalelite are you using?
as soon as a server is detected to be offline, can Scalelite automatic "Panic" that server?
With the changes of v1.0.2 this should be implemented here as soon as a server is
unhealthy
Yeah, I'm using v1.0.2 will try again and will report right back to you
ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all
D, [2020-05-09T21:12:29.205584 #532] DEBUG -- : Polling servers
D, [2020-05-09T21:12:29.205703 #532] DEBUG -- : Polling meetings
D, [2020-05-09T21:12:29.222834 #532] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:12:29.224178 #532] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:12:39.224548 #532] WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: execution expired
I, [2020-05-09T21:12:39.224774 #532] INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:12:39.224829 #532] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:12:39.225658 #532] WARN -- : Failed to check meeting id=comercial-blox status: execution expired
D, [2020-05-09T21:12:39.226221 #532] DEBUG -- : Polling Meeting id=dev-blox on Server id=1a977411-2251-462c-9c2e-7fd0684c383c
I, [2020-05-09T21:12:39.249045 #532] INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:12:39.249791 #532] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
I, [2020-05-09T21:12:39.250786 #532] INFO -- : Meeting id=dev-blox on Server id=1a977411-2251-462c-9c2e-7fd0684c383c has ended
D, [2020-05-09T21:12:39.251696 #532] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:12:49.251219 #532] WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:12:49.251438 #532] INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable
W, [2020-05-09T21:12:49.252676 #532] WARN -- : Failed to check meeting id=papo-de-boteco-blox status: execution expired
ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all
D, [2020-05-09T21:13:15.061603 #543] DEBUG -- : Polling meetings
D, [2020-05-09T21:13:15.061731 #543] DEBUG -- : Polling servers
D, [2020-05-09T21:13:15.079340 #543] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:13:15.079519 #543] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:13:25.081618 #543] WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: execution expired
I, [2020-05-09T21:13:25.081847 #543] INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:13:25.081894 #543] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:13:25.082774 #543] WARN -- : Failed to check meeting id=comercial-blox status: execution expired
D, [2020-05-09T21:13:25.083343 #543] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
I, [2020-05-09T21:13:25.099241 #543] INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:13:25.099304 #543] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
W, [2020-05-09T21:13:35.088356 #543] WARN -- : Failed to check meeting id=papo-de-boteco-blox status: execution expired
W, [2020-05-09T21:13:35.100330 #543] WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:13:35.100527 #543] INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable
ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all
D, [2020-05-09T21:13:52.374683 #553] DEBUG -- : Polling meetings
D, [2020-05-09T21:13:52.374810 #553] DEBUG -- : Polling servers
D, [2020-05-09T21:13:52.392417 #553] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:13:52.392893 #553] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:14:02.395032 #553] WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: execution expired
I, [2020-05-09T21:14:02.395268 #553] INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:14:02.395338 #553] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:14:02.395986 #553] WARN -- : Failed to check meeting id=comercial-blox status: execution expired
D, [2020-05-09T21:14:02.396777 #553] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
I, [2020-05-09T21:14:02.413953 #553] INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:14:02.414037 #553] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
W, [2020-05-09T21:14:12.397879 #553] WARN -- : Failed to check meeting id=papo-de-boteco-blox status: execution expired
W, [2020-05-09T21:14:12.415210 #553] WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:14:12.415430 #553] INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable
@paul1278 Can confirm that this is a bug.
By using bin/rake servers I can confirm that Scalelite updated the server to Offline (automatic polling). But when I use the front end, it still sends the create/join to the offline server. I executed this command more than 15 minutes after shuting down the server, enough time for several polling.
When executing bin/rake poll:all I can see several warnings such as:
WARN -- : Failed to check meeting id=comercial-blox status: execution expired
For interpretation of the log:
Those servers are OFFLINE
e4308e31-da25-46d5-bb2a-93f37a5f49e1
and
cada0889-c605-4a67-b173-c1d3be939f7d
This one is ONLINE
1a977411-2251-462c-9c2e-7fd0684c383c
For some reason, the servers did not "Panic" at all.
EDIT: some more logs with new errors
ubuntu@loadbalancer:~$ sudo docker exec -it scalelite-api bin/rake poll:all
D, [2020-05-09T21:29:23.101052 #564] DEBUG -- : Polling meetings
D, [2020-05-09T21:29:23.101179 #564] DEBUG -- : Polling servers
D, [2020-05-09T21:29:23.118155 #564] DEBUG -- : Polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
D, [2020-05-09T21:29:23.118585 #564] DEBUG -- : Polling Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
W, [2020-05-09T21:29:23.158610 #564] WARN -- : Failed to get server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 status: internalError: Response did not include returncode
I, [2020-05-09T21:29:23.158870 #564] INFO -- : Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1 offline load: unavailable
D, [2020-05-09T21:29:23.158916 #564] DEBUG -- : Polling Server id=1a977411-2251-462c-9c2e-7fd0684c383c
W, [2020-05-09T21:29:23.160298 #564] WARN -- : Unexpected BigBlueButton error polling Meeting id=comercial-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1: internalError: Response did not include returncode
D, [2020-05-09T21:29:23.161087 #564] DEBUG -- : Polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1
I, [2020-05-09T21:29:23.176610 #564] INFO -- : Server id=1a977411-2251-462c-9c2e-7fd0684c383c online load: 0.0
D, [2020-05-09T21:29:23.176687 #564] DEBUG -- : Polling Server id=cada0889-c605-4a67-b173-c1d3be939f7d
W, [2020-05-09T21:29:23.180137 #564] WARN -- : Unexpected BigBlueButton error polling Meeting id=papo-de-boteco-blox on Server id=e4308e31-da25-46d5-bb2a-93f37a5f49e1: internalError: Response did not include returncode
W, [2020-05-09T21:29:33.179122 #564] WARN -- : Failed to get server id=cada0889-c605-4a67-b173-c1d3be939f7d status: execution expired
I, [2020-05-09T21:29:33.179328 #564] INFO -- : Server id=cada0889-c605-4a67-b173-c1d3be939f7d offline load: unavailable
Hi @bruno-berchielli,
Can you double check and confirm that you're running the latest version? I can see the errors indicating the server is offline, which should be followed by a log saying that the server is unhealthy and then panicking that server. I don't see any of those logs which could mean that you're running an earlier version
I observed the same behavior on systems explicitly disabled with server:disable, and also showing disabled in server:status (no logs atm; Will reproduce during the next maintenance window). I am running 1.0.2;
This kind of makes my maintenance plan (take out of cluster, upgrade, put back in, next) flaky as well. ;-)
can confirm, i have the same issue. EDIT: i paniced the server manually before, so it seems to be a bug in the software.
@farhatahmad yes I'm sure I'm running the current eversion
The unhealthy followed by panicking never happned.
Thanks
Gonna tag this as a bug. Do you guys have concrete steps to reproduce?
Not really. It is like 'start a meeting, be delegated to server X, stop meeting, do server:disable[id] for that server, start meeting in GL again'
Could this be related to #233?
Yeah I believe you're correct @ichdasich. I can confirm that the poller does re-enable already disabled servers if they go offline and then come back online
May potentially be fixed by #241.
Will wait until after the release for you guys to let me know if it happens again
I disabled and delete the server and continue asign it the meeteing to this server. depide that dissapear to server status
So... I still see this with 1.0.3. I start two meetings two nodes in the cluster, one per node. I disable one server. I stop the meeting on that server. I start the meeting again, still ends up on the same node.
@ichdasich What you're describing in actually expected behaviour. There are some front-ends that send a create and join request for everytime a user tries to join a meeting. In the case where there is a meeting on a server, and that server gets disabled, we still want to allow users to join the meetings on that server. Now the only way to do that is to check if the meeting already exists (which it does) and send the create and join call to that server. If we don't do this, users joining a meeting on a disabled server would result in a new meeting being created on a different, enabled server.
There does seem to be an issue with meetings being created on offline servers but I'm still trying to track down the exact cause
@farhatahmad The case is slightly different from what you describe. I start the meeting, disable the server, stop the meeting (meeting count in status command goes to 0!) and then start the meeting anew, and it ends up on the disabled server.
I would expect that, after the meeting was fully stopped, it would end up on a new server when started again. Or is there some ghost-time a stopped meeting hangs around?
There is a ghost time as Scalelite doesn't know the meeting has ended until the poller runs atleast once. Try the same test, but waiting ~60 seconds before creating the room again
@farhatahmad the problem that I faced when I created this issue is exactly what @ichdasich described just now.
I know about the polling interval, but Scalelite simply didn't clear the meeting even after several hours that the meeting has ended and running poll:all command several times. I noticed that this occured when se server was marked as offline.
BTW, as a enhancement suggestion, I believe that Scalelite should treat an "offline" server differently than an "disable" server. Disable server = server is working but I don't want NEW meetings created there. Offline server = server is down and new/ongoing meetings should NOT be redirected there in any circunstance.
Thanks
Will give it a try during the next maintenance window.
I know about the polling interval, but Scalelite simply didn't clear the meeting even after several hours that the meeting has ended and running poll:all command several times.
I think I might have an idea. The meetings that are being created on the offline server, were they explicitly ended (using the End Meeting button in BBB) before the server went offline? Or were they still running when the server went down?
BTW, as a enhancement suggestion, I believe that Scalelite should treat an "offline" server differently than an "disable" server. Disable server = server is working but I don't want NEW meetings created there. Offline server = server is down and new/ongoing meetings should NOT be redirected there in any circunstance.
That's actually the expected behaviour
Heho, I end the meeting with 'end meeting' as moderator, and it also disappears from the cluster overview, as mentioned above. But, again, I will double verify this again during the next maint window.
Hello,
Would like to report that I´m not experiencing anymore the bug described in this issue. I was running 1.0.2 and now I´ve updated to 1.0.5 and everything is resulting in the expected behaviour. If the latest version also fixed this issue to you guys, I think we can close this.
Thank you
Hello,
Thanks for this brilliant application!
Our front-end sends a create everytime before a join request. But we are facing the current scenario:
Shouldn't Scalelite create the new meeting (with same previous ID) on another server that is avaiable?
I don't understand why it would send a join/create request to a server that is known to be offline.
Thank you