Open 20k-ultra opened 3 years ago
[pipex] This issue has attached support thread https://jel.ly.fish/ee410451-f87c-4cc9-a051-7f291288253b
I closed the issue because of the PR but the PR doesn't really cover all cases. So reopened
I think we missed an opportunity on https://github.com/balena-os/balena-supervisor/pull/1692https://github.com/balena-os/balena-supervisor/pull/1692. We should've made https://github.com/balena-os/balena-supervisor/blob/853acae4a8e5b4ef8f67a16fedd1d550611477bb/src/device-state/current-state.ts#L94 log that there was an error updating the current device state. In addition to logging what is in the response's body we should explain what the failure means. For example, a lot of times people think the 503 error is causing their application to not work but that's just the device cannot update the dashboard.
[20k-ultra] This issue has attached support thread https://jel.ly.fish/e0591dc4-8474-427b-bcfe-4715e36d795a
We have merged in https://github.com/balena-os/balena-supervisor/pull/1692 which has added some improvements but I think we can further improve API error messages because the following example is pretty verbose with a lot of unnecessary info
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error] Error from the API: 503
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error] Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your re>
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error] at /usr/src/app/dist/app.js:22:554753
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error] at runMicrotasks (<anonymous>)
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error] at async /usr/src/app/dist/app.js:22:554042
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error] at async /usr/src/app/dist/app.js:22:555620
Additionally, sometimes the error message is Non-200 response from the API! Status code: xxx - message: Error
without anything else. I'm wondering if message-less errors are being sent from the cloud in some instances.
There are a few status codes which an empty body is expected, 500s..504..503..204..304..etc. We could better handle those but it's meh. Removing the stack trace and including why the device wanted to make the call and if it will retry will be more useful.
203db23 [error] Error from the API: 503
203db23 [error] Device was unable to fetch the latest target state.
203db23 [error] Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request"
203db23 [error] Will retry in 5 seconds.
following a Go idiomatic error is also an idea that was suggested which will explain what was attempted and why it failed in a single line.
error while doing X: failed to do Y: no disk space
I think the above example would translate to:
203db23 [error] Unable to fetch latest target state: Network call to cloud API returned 503 - The API cannot handle your request
203db23 [error] Will retry in 5 seconds
If the action does not have a retry attempt then the 2nd log is omitted. The template would be
203db23 [error] ${action}: ${failure}
for network calls it's important to include the status code and body as it could contain an explanation for the status code. Therefore, ${failure} would be Network call to cloud API returned ${response.status} - ${response.body}
Even the additional information now available isnt particularly helpful. This is something I started seeing today (unsure of start time).
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request right now"
[error] at sendReportPatch (/usr/src/app/dist/app.js:22:643411)
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:644497
[error] at async /usr/src/app/dist/app.js:22:645405
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request right now"
[error] at sendReportPatch (/usr/src/app/dist/app.js:22:643411)
[error] at runMicrotasks (<anonymous>)
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:644497
[error] Error from the API: 503
[error] Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request right now"
[error] at sendReportPatch (/usr/src/app/dist/app.js:22:643411)
[error] at runMicrotasks (<anonymous>)
[error] at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error] at async /usr/src/app/dist/app.js:22:644497
Also shouldnt balena-cloud be monitored for 503 rate?
@splitice How often have you noticed these logs and when did you start seeing them? balenaCloud is indeed monitored for 503 rate and there haven't been any noticeable increases in 503 in the last 7 days. 503 logging is expected so I wouldn't worry too much about this error message -- from the Supervisor's POV it just means that the Supervisor will retry the last request.
@cywang117 Some time between 3 months ago and today. I only check supervisor as part of our release QA process and I would have noticed that during the last QA. Doesnt appear to affect anything other than creating a bunch of Journal entries (potentially flushing out other useful messages, but thats minor).
I can confirm that updating to the latest balena-supervisor has no effect (as expected it's a server side error). We are moving from 12.1.5 to 12.10.10 in this update. Both show similar messages.
I'm not sure what the frequency of these requests are but I see those messages every few minutes.
@splitice We're looking at an improvement to the Supervisor code that will set a backoff timer when it encounters 503 in a response, so that the API isn't repeatedly hammered despite sending 503. This will hopefully reduce the frequency of 503's in the journal logs.
Another change that's beginner-friendly is to remove the stack trace from being logged (i.e., starting 2nd line with at sendReportPatch (/usr/src...)
since it doesn't add more clarity and only causes user alarm. It also clogs journal entries, potentially flushing out useful logs as you say. Anyone who sees this, feel free to open a PR ;)
https://github.com/balena-os/balena-supervisor/pull/1798 has made some improvements to current state reporting errors. For example, whenever there was a network issue the Supervisor logged [event] Event: Device state report failure {"error":{"message":""}}
but now logs [event] Event: Device state report failure {"error":"getaddrinfo ENOTFOUND api.domain.com"}
Piggybacking off of @20k-ultra's comment above, you may find this more verbose error log by upgrading Supervisor via self-serve to v12.10.13 or later. See v12.10.13 changelog and the linked PR in the comment above.
@20k-ultra it's good timing too. Saw empty error messages yesterday due to a preference for IPv6 (which was down).
I've logged for hostos work to explore patching gai to prefer v4.
[mpous] This has attached https://jel.ly.fish/4927d500-29d0-4a35-9632-913250502c2b
Whenever the device interacts with the cloud API we should log what we are trying to communicate and log more of the response. For example, if we try to patch the target state we should say that when we get an error so that user knows there was a problem patching the target state....another place to do this is when trying to stream logs to the dashboard.
For example you will see something like:
Should explain that there's an authentication error when trying to update the current state in the cloud.