Closed RayCulp closed 3 years ago
Can not reproduce this in BBB 2.2.31. Please provide the version where you encountered this issue? Also what does bbb-conf --status
and bbb-conf --check
show?
We have same problem here, but its not default state, you need load for this to happen.
We have 7 BBB Nodes, hosted on VMware 6.7, each VM is equipped with 16 CPU Cores and 16 GB RAM, no over provisioning, dedicated resources for each VM. All this is managed by Scalelite node. Our load is about 2000 concurrent users across all nodes, so about 300 Users per node.
At monday this week we had only 4 Nodes and 1200 users without any issues, at monday evening we added one more node as we expected higher load at tuesday. At tuesday we peaked to 1900 users and first nodes loaded with over 400 users startet to fail on whiteboard feature. Tuesday evening we added 2 more nodes to mitigate this, today our peak was 1850 users. Clients on BBB Node 3 were reporthing same issue as OP, we checked our monitoring and load of this node at the time of report was about 350 active users. So I think this problem starts somwhere at 350+ users per node with HW config mentioned above.
P.S. all VMs have VMXNET3 Adapter and ESXi Servers are connected with 10G to Uplink Switches. We also have 10G Internet Uplink.
We use 2.2.31 on all our nodes. As for the load type, if we look at node 3, from 350 active users, 20 Users with Webcam, rest ony audio. Classic online class scenario, teacher is presenting, students are listening.
Your setup sounds like it is already quite good and follows important best practices, however, to further understand the issue, could you please provide NodeJS/meteor process CPU usage statistics as well as total CPU usage graph for one of those VMs where the problem happened? A screenshot from Grafana or similar showing these would be perfect. If you don't have anything to monitor NodeJS CPU usage, I'd suggest using this or simething similar: https://gitlab.senfcall.de/senfcall-public/nodejs-cpu-monitor Also, which CPU (Type number + generation) does the VM host use?
Unfortunately we have not monitored host system till today, we monitored only load metrics exposed by bbb api. So I will be able to provide requested information only tomorrow after next peak. Yeasterday at a peak time I was randomly connecting to our hosts and monitored ressource usage using htop, CPU load on a node with 350 users was 65-70%. Although I cant tell how much CPU was NodeJS using. Today we have deployed system monitoring for all BBB Host, Scalelite Host, NFS Share Host, we also installed nodejfs-cpu-monitor as you suggested, so by tomorrow I can tell more.
As for CPUs used in our ESXi hosts: 4x Nodes on Intel Xeon E5-2630 v3, 1x Node on Intel Xeon E5-2640 v4, 2x Nodes on Intel Xeon Gold 5115. We have as much ESXi Hosts as we have BBB Nodes, because as I mentioned before, we do not over provision. All ESXi hosts are dual CPU setup. No hyperthreading enabled, only raw cores assigned.
Can not reproduce this in BBB 2.2.31. Please provide the version where you encountered this issue? Also what does
bbb-conf --status
andbbb-conf --check
show?
Sorry, I have no information about the version running on https://demo2.bigbluebutton.org/ I would assume it is the latest version.
Demo2 ist just a beefy demo server, which is overrun this week by teachers who think they can use it for their online classes. Please do notuse thedemo system for production related stuff.
For your own servers where the issue is happening, I guess your server also reached izs load limit for the NodeJS based single-threaded meteor-based frontend and event handling stuff. Add the above mentioned NodeJS monitoring component so you can see when your servers reached their concurrent users limit.
Tomorrow we should be able to see if this also applies for the case of rst-consulting or if tgere is some other problem to be discovered.
As far as I know BBB uses Node JS 8, with Node JS 12 "worker_threads" was introduced. BBB should really prioritize moving from ubuntu 16.04 to something newer and make use of worker threads in node js 12.
So for my understanding, if my problem is node js using 100% of one core, I need to reduce overall load on the server. As my 16 Cores are not under 100% load I assume that rather than making one big BBB node, I should split it in 2 smaller ones, for example 8 Cores an 8 GB RAM, so I can host 200 users on a smaller node without problems. Using 2x 8 Core, 8GB RAM VMs I can still host total of 400 users without running in node js single core limitations. Now with big BBB Node 16 Cores and 16 GB RAM I cant host 400 users without issues with node js.
rst-consuting, the CPUs you mentioned have 8 or 10 physical cores, not 16 or 20. I'd suggest you to ignore the virtual threads count.
Yes and I have two of them in each ESXi host, so I have 16 to 20 Cores per ESXi host and not more than one BBB node per ESXi host. On ESXi hosts with 16 Cores, only BBB node is hosted there, on servers with 20 cores, there is one BBB node and some small test VMs that do not generate any CPU load. As I mentioned before, we dont have hyperthreading enabled due to security concerns.
As suggested, we installed https://gitlab.senfcall.de/senfcall-public/nodejs-cpu-monitor on each BBB host and imported the data to our Grafana. If I understand it correctly all values > 1.0 mean that NodeJS loaded 1 core to 100%, althroug I dont really understand how can it go over 1. If my assumption is not correct, than it means that NodeJS is using 1% of one core at top peak and the problem with whiteboard must originate from somwhere else.
But anyways, on all Nodes where NodeJS load was under 1, I did not have any whireboard related problems, on nodes that went over 1.0, problems were reported.
Today we replaced one 16 Core + 16 GB RAM Node with 8 Core and 8 GB RAM, but we left loadmultiplier on Scalalite same as 16 core + 16 GB RAM nodes, to test how far can we go on smaller nodes. Results are: everything till 200 Users is perfectly fine, 200-230 red zone and after 230 you can expect audio and vidio stuttering as well as whiteboard malfunction. So in our case, to handle the load of approximately 2000 users, I will split my 16 core + 16 GB RAM nodes in 2 nodes with 8 Cores and 8 GB RAM, it will give me total of 12 Servers with 8 Cores and 8 GB RAM, 200*12=2400, with average load of 166 users per node.
some things like IO operations, encryption and similar are done multithreaded in NodeJS, thus ~ 110 to 120% seems to be the maximum for a server where it starts to not be usable any more.
tldr, please read the last post of https://github.com/bigbluebutton/bigbluebutton/issues/10739#issuecomment-748410682
Describe the bug Annotations only appear after refreshing screen (F5)
To Reproduce Steps to reproduce the behavior:
Expected behavior Other users in conference session should be able to see annotations (perhaps after a short delay of a few seconds at the most)
Actual behavior Other users can see, for example, the dot with the user name that represents the position of the pencil tool moving around on the screen, but the actual annotation only appears after refreshing the web page with F5
Screenshots None
BBB version (optional): This happens with the demo https://demo2.bigbluebutton.org/
Desktop (please complete the following information):
Additional context None