balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
148 stars 63 forks source link

Healthchecks fail - applyInProgress never false #1600

Open 20k-ultra opened 3 years ago

20k-ultra commented 3 years ago

Still a theory but cycleTimeWithinInterval is not within the needed interval and I think it's because the cloud API returned a 503 which prevents the Supervisor from updating its value for cycleTimeWithinInterval.

This basically means that if the cloud API goes down then a lot devices will begin to restart their Supervisor.


Feb 17 23:05:27 54acdbd resin-supervisor[544923]: [api]     GET /v1/healthy 200 - 4.215 ms
Feb 17 23:06:52 54acdbd resin-supervisor[544923]: [error]   Error from the API: 503
Feb 17 23:06:52 54acdbd resin-supervisor[544923]: [error]   Non-200 response from the API! Status code: 503 - message: Error
Feb 17 23:06:52 54acdbd resin-supervisor[544923]: [error]         at /usr/src/app/dist/app.js:22:554480
Feb 17 23:06:52 54acdbd resin-supervisor[544923]: [error]       at runMicrotasks (<anonymous>)
Feb 17 23:06:52 54acdbd resin-supervisor[544923]: [error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
Feb 17 23:06:52 54acdbd resin-supervisor[544923]: [error]       at async /usr/src/app/dist/app.js:22:553788
Feb 17 23:10:26 54acdbd resin-supervisor[544923]: [debug]   Attempting container log timestamp flush...
Feb 17 23:10:26 54acdbd resin-supervisor[544923]: [debug]   Container log timestamp flush complete
Feb 17 23:10:28 54acdbd resin-supervisor[544923]: [info]    Healthcheck failure - At least ONE of the following conditions must be true:
Feb 17 23:10:28 54acdbd resin-supervisor[544923]: [info]              - No applyInProgress      ? false
Feb 17 23:10:28 54acdbd resin-supervisor[544923]: [info]            - fetchesInProgress       ? false
Feb 17 23:10:28 54acdbd resin-supervisor[544923]: [info]            - cycleTimeWithinInterval ? false
Feb 17 23:10:28 54acdbd resin-supervisor[544923]: [error]   Healthcheck failed
Feb 17 23:10:28 54acdbd resin-supervisor[544923]: [api]     GET /v1/healthy 500 - 13.101 ms
Feb 17 23:11:37 54acdbd resin-supervisor[544923]: [api]     GET /v1/device 200 - 19.761 ms
Feb 17 23:11:43 54acdbd resin-supervisor[544923]: [api]     GET /v1/device 200 - 15.966 ms
Feb 17 23:15:28 54acdbd resin-supervisor[544923]: [info]    Healthcheck failure - At least ONE of the following conditions must be true:
Feb 17 23:15:28 54acdbd resin-supervisor[544923]: [info]              - No applyInProgress      ? false
Feb 17 23:15:28 54acdbd resin-supervisor[544923]: [info]            - fetchesInProgress       ? false
Feb 17 23:15:28 54acdbd resin-supervisor[544923]: [info]            - cycleTimeWithinInterval ? false
Feb 17 23:15:28 54acdbd resin-supervisor[544923]: [error]   Healthcheck failed
Feb 17 23:15:28 54acdbd resin-supervisor[544923]: [api]     GET /v1/healthy 500 - 10.452 ms
Feb 17 23:20:26 54acdbd resin-supervisor[544923]: [debug]   Attempting container log timestamp flush...
Feb 17 23:20:26 54acdbd resin-supervisor[544923]: [debug]   Container log timestamp flush complete
Feb 17 23:20:29 54acdbd resin-supervisor[544923]: [info]    Healthcheck failure - At least ONE of the following conditions must be true:
Feb 17 23:20:29 54acdbd resin-supervisor[544923]: [info]              - No applyInProgress      ? false
Feb 17 23:20:29 54acdbd resin-supervisor[544923]: [info]            - fetchesInProgress       ? false
Feb 17 23:20:29 54acdbd resin-supervisor[544923]: [info]            - cycleTimeWithinInterval ? false
Feb 17 23:20:29 54acdbd resin-supervisor[544923]: [error]   Healthcheck failed
Feb 17 23:20:29 54acdbd resin-supervisor[544923]: [api]     GET /v1/healthy 500 - 12.006 ms
Feb 17 23:20:40 54acdbd systemd[1]: resin-supervisor.service: Main process exited, code=exited, status=137/n/a
Feb 17 23:20:40 54acdbd systemd[1]: resin-supervisor.service: Failed with result 'exit-code'.
```a
20k-ultra commented 3 years ago

While investigating this issue I saw at least 3 devices from separate customers that would not apply a new target state. The Supervisor state showed update_pending as true but the logs would only show applying target state and then nothing. This is in addition to the above 503 error followed by healthcheck fail and restart of Supervisor which all 3 devices have shown.

jellyfish-bot commented 3 years ago

[20k-ultra] This issue has attached support thread https://jel.ly.fish/a83c538a-e82f-434b-bb50-e3f98065731c

jellyfish-bot commented 3 years ago

[20k-ultra] This issue has attached support thread https://jel.ly.fish/4ff915f6-f9ce-405b-a071-701853ef6349

jellyfish-bot commented 3 years ago

[20k-ultra] This issue has attached support thread https://jel.ly.fish/44cbdeef-93c4-4941-92fa-e47c0eda6c8f

20k-ultra commented 3 years ago

we should be able to make this call throw an exception to see if the healthcheck fails and new release is not installed

20k-ultra commented 3 years ago

I tested the above on my device where I made the Supervisor think it was getting only 503s from the API and that did not cause the Supervisor the fail healthchecks and restart.

This theory was flawed because of my understanding how the healthcheck variable cycleTimeWithinInterval is updated.

20k-ultra commented 3 years ago

https://jel.ly.fish/a83c538a-e82f-434b-bb50-e3f98065731c was resolved by removing the preloaded application config file and allowing the device to pull all the images from the cloud.