balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
147 stars 63 forks source link

Device problem pattern matching #1569

Open 20k-ultra opened 3 years ago

20k-ultra commented 3 years ago

This issue is to track patterns that emerge when a device is experiencing an issue. These patterns are questions that get asked by owners of the device and people helping to resolve issue (forum or Balena engineers).

The goal is to document these questions so that we can work towards automating the answer so that there is no need for someone specialized in the Supervisor to be able to answer them.

20k-ultra commented 3 years ago

Regarding "why isn't the Supervisor starting my container that is stuck in Downloaded" I believe I documented a cause for that here: https://github.com/balena-io/balena-supervisor/issues/1533#issuecomment-754721617

20k-ultra commented 3 years ago

Regarding why did the Supervisor get killed, that was added from the internal discussion in this FD thread and https://github.com/balena-io/balena-supervisor/issues/1575 was opened as well.

20k-ultra commented 3 years ago

Regarding why does the Supervisor say it is on an old release commit when all the containers are running target... I think https://github.com/balena-io/balena-supervisor/issues/1579 is related.

20k-ultra commented 3 years ago

re: Why isn't the Supervisor applying the target state ?

For one my own personal development devices I've done a lot of things to break it like delete the database file, manually delete containers/images via balena cmd, even ran database files from different devices/applications. As a result when I tried to bring my device to normal operation I saw that it wasn't applying my target state. Only after stepping through the code that applies target state I realize that the code kept executing https://github.com/balena-io/balena-supervisor/blob/432b1dbcc59a3a2c72a78d40e2a8782208438d0a/src/compose/application-manager.ts#L200-L208 which somehow did nothing. There weren't any errors, the steps to execute just kept containing 1 action: noop.

I wanted to see if maybe there was a network created by the Supervisor that existed but the Supervisor wasn't able to detect it somehow so kept trying to create it before applying any other steps so I stopped all my containers and ran balena system prune -a to remove everything on the device. Then when I started the Supervisor again it applied my target state.

tl;dr something to do with the supervisor network existing or not prevented the device from applying target state. If the logs can say "Supervisor network not ready" followed by what it's going to try and do about it would then provide huge insight into that fact it's in a loop trying to recreate it

edit: solved this exact scenario with https://github.com/balena-io/balena-supervisor/issues/1594

cywang117 commented 3 years ago

@20k-ultra Regarding your last comment, I definitely overlooked this during our last call but this is potentially the same problem I've been experiencing. I pruned some images and volumes independently of the state engine and resulted in a loop-like scenario. I recall saying during the checkin call that I'm not sure if your experience is similar to mine, but it looks like it is! Thanks for the note about how to fix this.

20k-ultra commented 3 years ago

That issue should be resolved now https://github.com/balena-io/balena-supervisor/issues/1594 @cywang117. I've updated my comment to mention that issue.

cywang117 commented 3 years ago

@20k-ultra Awesome! Thank you 👍🏼

cywang117 commented 3 years ago

Regarding the 404 errors above, while the 404 errors themselves can be very broad, this could fall into the broader pattern of: "Communicate error reason and origin clearly when errors occur." A recent issue that falls into this category, and isn't too difficult to fix: https://github.com/balena-os/balena-supervisor/issues/1654