Closed ptrm closed 4 years ago
@ptrm thanks for looking into this! What if we make gotty
the primary process in the container? Most of the devices on the fleet will be using the web UI so it would make sense, and if we do that the container engine should look after the process if it was ever to stop. We could also look at using Docker health checks: https://forums.balena.io/t/demo-of-docker-healthcheck-for-a-service/3133
What if we make
gotty
the primary process in the container? Most of the devices on the fleet will be using the web UI so it would make sense, and if we do that the container engine should look after the process if it was ever to stop.
Sounds good.
We could also look at using Docker health checks: https://forums.balena.io/t/demo-of-docker-healthcheck-for-a-service/3133
Here I got discouraged yesterday by the fact the unhealthy image has to be restarted manually from outside the container, but then today I noticed balena healthcheck image popping in the stats:
root@d9f4023:~# balena images
REPOSITORY TAG IMAGE ID CREATED SIZE
[...]
balena/aarch64-supervisor v10.8.0 110fa2afa572 2 months ago 67MB
balena-healthcheck-image latest a29f45ccde2a 3 months ago 9.14kB
which looks like doing the job for unhealthy images. So yeah, that sounds much better than manually curling in a loop :)
Then there's a way to set a tty-bound app on config / systemd level, which also might delegate restarts outside the start script. This works on my Odroid Go Advance console, which boots straight into htop, but then it's got a debian with systemd onboard, the containers are much thinner.
On the other hand, the ui with the updated start script still got stalled on my rpi4/4GB (seems to be more common). What's curious is it newer reached memory load higher than 3GB, but still the ui service refused to run (both through dashboard and by running balena run <id>
in the balena os), and the loads were above 5, so I suspect some cpu or io stuff to get locked. I tried fiddling with cpu_shares
compose param to lower boinc processes' priority during high cpu loads, and will see what comes of it, as the stall happens usually after around ~8h of uptime.
I still strive to utilise as many cores as possible on <2.5GB pis and hope the ui issue is connected with reboots of lower RAM devices, so all troubles can go with one fix ;>
The trouble after the changes is boinctui is running at 100% cpu for no apparent reason:
After many commits the change has become a sipmle one. I moved the tty-attached boinctui to the exec part, because I was not able to investigate the 100% core utilisation when executed otherwise.
Also, the healthcheck part will restart the ui anyway, I put it in the compose file to avoid duplicating it across two ui dockerfiles.
There's also some fiddling wtih io and cpu shares through kernel's native scheduler. They should only affect boinc processes if balena or other container has something weihgty to do, else the performance should not be affected. Did not help for multitask reboots on <2.5GB devices, but might help with the overall responsiveness of the device.
Tty-bound boinctui seems to be much calmer again, I did not have a spare lcd to check what could be the matter with the high utilisation.
As for the many task inducted reboots, I tried saving dmesg logs through uart console, but nothing appeared between docker network devices settling down and the bootloader messages.
@ptrm thanks for all the efforts and documentation as you go, as always! Would you mind squashing all these commits? I'll give this branch a test as well.
Would you mind squashing all these commits? I'll give this branch a test as well.
Here you go :)
@balena-ci retest
I encountered the ui service on inaccessible but working machines shows this way in at least some cases in
![Screenshot from 2020-04-30 17-37-07](https://user-images.githubusercontent.com/366258/80730097-70b38080-8b09-11ea-8c16-0ec1d0db9fe3.png)
balena stats
:First fix was to prepend the ui's last
start.sh
line withexec
, second to check if gotty is still running and restart if not.These changes build and start, hence the draft to see what you think, but I still have not caught any actual restarts ;)