balena-os / balena-allwinner

Apache License 2.0
20 stars 14 forks source link

Watchdog triggered during high load #39

Open splitice opened 6 years ago

splitice commented 6 years ago

We have observed that when devices are stressed a watchdog based device reset can occur. Particularly with disk usage that the default watchdog ping timeout can easily be overrun resulting in unexpected device restarts.

There will likely be multiple steps required in order to fully rectify this issue. Some possibilities include:

If it is possible to replicate this on other devices it may be more appropriate to raise this on meta-resin instead. I'll leave that to you to decide.

splitice commented 6 years ago

Here is a test which you can replicate easily. It can also be used to replicate the issue on Armbian if the systemd watchdog timeout is set to 10s as per the configuration here.

First, waste 350mb of memory:

mkdir /tmp/test
mount -t tmpfs none /tmp/test -o size=350m
dd if=/dev/zero of=/tmp/test/zero.txt bs=1024k count=350

Then startup a large number of sleep processes (copy and paste into terminal):

sleep 100 & 

This is quite a harsh test, not reflective of real world loads - however this is an issue we have seen occur on real software (suspected during the application of delta updates).

splitice commented 6 years ago

It's worth noting that the Allwinner H3 has a max watchdog timeout value of 16s. Current systemd main loops are far in excess of this value.