balena-labs-projects / rosetta-at-home

80 stars 17 forks source link

Attempt a better ui failure detection for both gotty and tty #38

Closed ptrm closed 4 years ago

ptrm commented 4 years ago

I encountered the ui service on inaccessible but working machines shows this way in at least some cases in balena stats: Screenshot from 2020-04-30 17-36-57 Screenshot from 2020-04-30 17-37-07

First fix was to prepend the ui's last start.sh line with exec, second to check if gotty is still running and restart if not.

These changes build and start, hence the draft to see what you think, but I still have not caught any actual restarts ;)

chrisys commented 4 years ago

@ptrm thanks for looking into this! What if we make gotty the primary process in the container? Most of the devices on the fleet will be using the web UI so it would make sense, and if we do that the container engine should look after the process if it was ever to stop. We could also look at using Docker health checks: https://forums.balena.io/t/demo-of-docker-healthcheck-for-a-service/3133

ptrm commented 4 years ago

What if we make gotty the primary process in the container? Most of the devices on the fleet will be using the web UI so it would make sense, and if we do that the container engine should look after the process if it was ever to stop.

Sounds good.

We could also look at using Docker health checks: https://forums.balena.io/t/demo-of-docker-healthcheck-for-a-service/3133

Here I got discouraged yesterday by the fact the unhealthy image has to be restarted manually from outside the container, but then today I noticed balena healthcheck image popping in the stats:

root@d9f4023:~# balena images
REPOSITORY                                                       TAG                      IMAGE ID            CREATED             SIZE
[...]
balena/aarch64-supervisor                                        v10.8.0                  110fa2afa572        2 months ago        67MB
balena-healthcheck-image                                         latest                   a29f45ccde2a        3 months ago        9.14kB

which looks like doing the job for unhealthy images. So yeah, that sounds much better than manually curling in a loop :)

Then there's a way to set a tty-bound app on config / systemd level, which also might delegate restarts outside the start script. This works on my Odroid Go Advance console, which boots straight into htop, but then it's got a debian with systemd onboard, the containers are much thinner.


On the other hand, the ui with the updated start script still got stalled on my rpi4/4GB (seems to be more common). What's curious is it newer reached memory load higher than 3GB, but still the ui service refused to run (both through dashboard and by running balena run <id> in the balena os), and the loads were above 5, so I suspect some cpu or io stuff to get locked. I tried fiddling with cpu_shares compose param to lower boinc processes' priority during high cpu loads, and will see what comes of it, as the stall happens usually after around ~8h of uptime.

I still strive to utilise as many cores as possible on <2.5GB pis and hope the ui issue is connected with reboots of lower RAM devices, so all troubles can go with one fix ;>

ptrm commented 4 years ago

The trouble after the changes is boinctui is running at 100% cpu for no apparent reason: image

ptrm commented 4 years ago

After many commits the change has become a sipmle one. I moved the tty-attached boinctui to the exec part, because I was not able to investigate the 100% core utilisation when executed otherwise.

Also, the healthcheck part will restart the ui anyway, I put it in the compose file to avoid duplicating it across two ui dockerfiles.

There's also some fiddling wtih io and cpu shares through kernel's native scheduler. They should only affect boinc processes if balena or other container has something weihgty to do, else the performance should not be affected. Did not help for multitask reboots on <2.5GB devices, but might help with the overall responsiveness of the device.

ptrm commented 4 years ago

image

Tty-bound boinctui seems to be much calmer again, I did not have a spare lcd to check what could be the matter with the high utilisation.

As for the many task inducted reboots, I tried saving dmesg logs through uart console, but nothing appeared between docker network devices settling down and the bootloader messages.

chrisys commented 4 years ago

@ptrm thanks for all the efforts and documentation as you go, as always! Would you mind squashing all these commits? I'll give this branch a test as well.

ptrm commented 4 years ago

Would you mind squashing all these commits? I'll give this branch a test as well.

Here you go :)

chrisys commented 4 years ago

@balena-ci retest