Closed semzor closed 3 years ago
Is this happening a lot? Is there any specific scenario that causes this to occur?
Well I couldn't find anything that might suggest the source of the problem within logs. It happened 3 or 4 times a day, usually but not always during busy hours.
We are using scalelite with moodle bbb plugin. Around the time that problem occurred, teachers were recording lessons without students due to exam week. So there are usually 45-50 meetings simultaneously and only one moderator on each meeting. I am also counting unique ip addresses that requests "playback.html" in a 3 minute time range. That number is 500-600 during peak times.
Anyway, I've modified /etc/systemd/system/scalelite-api.service service file and added:
Restart=on-failure
RestartSec=1s
That minimized the downtime. After that I've increased the puma worker number to 4 (it was 1 before). api service haven't crashed for 2 days now.
Hi,
I have the same problem as you @semzor. In fact, for me, It seems that docker upgrade proccess (with apt) is the culprit.
When apt is updating docker, I see this in /var/log/syslog
Dec 9 04:51:43 bbb-lb systemd[1]: scalelite-api.service: Main process exited, code=exited, status=143/n/a Dec 9 04:51:43 bbb-lb systemd[1]: scalelite-api.service: Failed with result 'exit-code'.
I attached the whole log if It can help anyone.
I think I'll try the Restart and RestartSec systemd propose by semzor.
We have the same problem as already described by @semzor - the scalelite-api container dies also with Ruby based segmentation faults. I analyzed the log and thought it would be always the same reason for the segmentation fault. But instead I found for todays 5 api-container crashes the following reasons:
Dec 17 07:36:23 scale001 docker[2896]: /usr/lib/ruby/2.6.0/ipaddr.rb:570: [BUG] Segmentation fault at 0x00007fac8aea8000
Dec 17 09:16:11 scale001 docker[16723]: /srv/scalelite/vendor/bundle/ruby/2.6.0/gems/activesupport-6.0.3.3/lib/active_support/parameter_filter.rb:62: [BUG] Segmentation fault at 0x0000000000000000
Dec 17 09:52:00 scale001 docker[1515]: /srv/scalelite/vendor/bundle/ruby/2.6.0/gems/activesupport-6.0.3.3/lib/active_support/parameter_filter.rb:78: [BUG] Segmentation fault at 0x00007fe0c4db0000
Dec 17 10:15:25 scale001 docker[1470]: /srv/scalelite/vendor/bundle/ruby/2.6.0/gems/puma-4.3.5/lib/puma/server.rb:791: [BUG] Segmentation fault at 0x0000000000000080
Dec 17 14:01:20 scale001 docker[1470]: /usr/lib/ruby/2.6.0/ipaddr.rb:617: [BUG] Segmentation fault at 0x00007f6dce410000
The consequences after the crashes are of course the same - the api-container dies.
Scalelite is also v1.0.8
Can you try setting the environment variable WEB_CONCURRENCY
to 6 to give puma more concurrent threads.
We have tested this setting. In the last 10 hours our Scalelite did not show any further crashes of the scalelita-api-container. Before of that we had 5 to 10 crashes a day.
Just add the line WEB_CONCURRENCY=6
to your /etc/default/scalelite
(setup without docker-compose) and restart the scalelite target. With docker-compose put it in then environment-section of scalelite-api.
Can you try setting the environment variable
WEB_CONCURRENCY
to 6 to give puma more concurrent threads.
Thanks for the suggestion ffdixon. As I said on my previous reply, I've changed WEB_CONCURRENCY to 4 and didn't have any issues since then. Though, It is not as busy on the server as the time crashes happened.
Hi guys, as the issue is concurrency in Puma, just keep in mind this is not a magical number. You want to align the number of workers up to to the number of CPUs you may have in your machine. In the same way you would do if you were adding workers to Nginx.
But I am glad you got it to work.
Looks like we've been able to find a solution - thank you all for your input.
I am using systemd to run docker containers. I am using v1.0.8. Below are the logs I've got from host's syslog. I've had to clip some portions due to chracter limit.
Ruby specific:
After ruby error, a lot of error messages like shown below:
Nov 27 18:59:45 bbb docker[29443]: 2020/11/27 15:59:45 [error] 609#609: *39516305 connect() failed (111 ....
Finally: