balena-os / meta-balena

A collection of Yocto layers used to build balenaOS images
https://www.balena.io/os
967 stars 115 forks source link

OS tests: ECONNRESET when fetching target state from SV #3510

Closed klutchell closed 3 days ago

klutchell commented 1 week ago

During our OS and HUP test suites, the DUT is in unmanaged and in development mode. However, making local requests to the Supervisor via the supervisor API often fails with ECONNRESET.

This happens much more often on underpowered devices like Raspberry Pi Zero, Raspberry Pi Zero 2 W, and similar. We almost never encounter the issue on emulated devices, or those with faster processors.

Known endpoints to encounter this issue:

_Originally posted by @klutchell in https://github.com/balena-os/meta-balena/pull/3420#discussion_r1742505915_

Zulip topic discussing this issue is here.

In the above thread we found that the presence of apiEndpoint in config.json causes the Supervisor to treat the device as managed, and will enter a crash loop. On faster devices this crash loop is difficult to notice without inspecting the logs, but on slower devices we often see the connection terminated when making local requests.

If the SV treats the device as managed and we hit a local endpoint, we should get unauthorized. However, when enabling local mode the unauthorized changes to ECONNRESET.

klutchell commented 1 week ago

https://github.com/balena-os/balena-raspberrypi/actions/runs/10655796340/job/29611644031?pr=1158 Balena RaspberryPi logs (1).zip

klutchell commented 1 week ago

https://github.com/balena-os/balena-raspberrypi/actions/runs/10655796337/job/29611637986?pr=1158 Balena RaspberryPi logs (2).zip

klutchell commented 1 week ago

https://github.com/balena-os/balena-raspberrypi/actions/runs/10655796334/job/29611628271?pr=1158 Balena RaspberryPi logs (3).zip

klutchell commented 1 week ago

Started a topic on Zulip here: https://balena.zulipchat.com/#narrow/stream/345889-balena-io.2Fos/topic/ECONNRESET.20when.20fetching.20target.20state.20from.20SV/near/467338834

klutchell commented 1 week ago

Testbot RaspberryPi0 2W 64 OS.zip Balena RaspberryPi Logs (4).zip

Attached host OS logs and job run logs for RPi Zero 2 W failing to get SV v2/local/target-state

Order of events from here:

  1. Create an application lock ✅
  2. Safe reboot waiting on application locks ✅
  3. Should not reboot until application lock is removed ✅
  4. Should reboot when application lock is removed ✅
  5. Wait for supervisor API to start (via ping endpoint) ✅
  6. Create an application lock ✅
  7. Safe reboot waiting on application locks ✅
  8. Get current v2/local/target-state so we can patch it in the next step ❌
klutchell commented 1 week ago

Testbot Raspberry Pi OS.zip Balena RaspberryPi logs (5).zip

Attached host OS logs and job logs for Raspberry Pi Zero. https://github.com/balena-os/balena-raspberrypi/actions/runs/10739294088/job/29786753019?pr=1158

Order of events from here.

  1. Waiting for supervisor to be reachable before local push ✅
  2. Pushing container to DUT... ✅
  3. Starting builds... ✅
  4. Setting device state... ✅ (failed last few runs, but worked this time)
  5. Get containerId endpoint
klutchell commented 1 week ago

Balena RaspberryPi logs (6).zip Testbot RaspberryPi4 OS.zip

Attached host OS logs and job logs for Raspberry Pi 4. https://github.com/balena-os/balena-raspberrypi/actions/runs/10739294112/job/29786299406?pr=1158

Order of events from here.

Waiting for supervisor to be reachable before local push ✅ Pushing container to DUT... ✅ Starting builds... ✅ Setting device state... ✅ Get containerId endpoint

klutchell commented 3 days ago

Resolved by https://github.com/balena-os/meta-balena/pull/3512