balena-os / leviathan

A distributed hardware testing framework
https://balena-os.github.io/leviathan
Apache License 2.0
22 stars 6 forks source link

Container pushing related tests timeout on RPi0 #418

Open vipulgupta2048 opened 3 years ago

vipulgupta2048 commented 3 years ago

https://jenkins.dev.resin.io/job/leviathan-raspberry-pi/674/console and more instances of it.

rcooke-warwick commented 3 years ago

https://github.com/balena-os/leviathan/issues/414

vipulgupta2048 commented 3 years ago

Update: The container tests on the rpi0 were timing out consistently. After a recent major investigation into container healthcheck tests timing out, we have found a potential reason behind these timeouts.

Some PRs slipped through automated meta-balena testing when Jenkins was building master instead of branches for almost a week. Those same PRs appeared to pass some manual testing as well since the symptoms aren't obvious. We identified this when we saw an increase in the number of balena engine systemd watchdog failures on Rpi0.

We narrowed it down to this PR: https://github.com/balena-os/meta-balena/pull/2175 and the work to restore this is in progress https://github.com/balena-os/meta-balena/pull/2245 More updates on the thread below.

This is a step in the right direction to make sure tests are more resilient and don't flake out for RPi0 due to how resource constrained that specific device is. The issue can be closed if we no longer see consistent test failures from RPi0 related to container tests.

Meta-balena issue: https://github.com/balena-os/meta-balena/issues/2248 Thread: https://www.flowdock.com/app/rulemotion/p-testbot/threads/0Sqx0v6bu1HRKUvu3XbYQMHveA4 cc: @rcooke-warwick