balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
148 stars 63 forks source link

Log better messages for cloud API error responses #1680

Open 20k-ultra opened 3 years ago

20k-ultra commented 3 years ago

Whenever the device interacts with the cloud API we should log what we are trying to communicate and log more of the response. For example, if we try to patch the target state we should say that when we get an error so that user knows there was a problem patching the target state....another place to do this is when trying to stream logs to the dashboard.

For example you will see something like:

f7402c2  [error]   Error from the API: 401
f7402c2  [error]   Non-200 response from the API! Status code: 401 - message: Error
f7402c2  [error]         at /usr/src/app/dist/app.js:22:554770
f7402c2  [error]       at runMicrotasks (<anonymous>)
f7402c2  [error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
f7402c2  [error]       at async /usr/src/app/dist/app.js:22:554078
f7402c2  [error]       at async /usr/src/app/dist/app.js:22:555600

Should explain that there's an authentication error when trying to update the current state in the cloud.

jellyfish-bot commented 3 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/ee410451-f87c-4cc9-a051-7f291288253b

pipex commented 3 years ago

I closed the issue because of the PR but the PR doesn't really cover all cases. So reopened

20k-ultra commented 3 years ago

I think we missed an opportunity on https://github.com/balena-os/balena-supervisor/pull/1692https://github.com/balena-os/balena-supervisor/pull/1692. We should've made https://github.com/balena-os/balena-supervisor/blob/853acae4a8e5b4ef8f67a16fedd1d550611477bb/src/device-state/current-state.ts#L94 log that there was an error updating the current device state. In addition to logging what is in the response's body we should explain what the failure means. For example, a lot of times people think the 503 error is causing their application to not work but that's just the device cannot update the dashboard.

20k-ultra commented 3 years ago

related: https://github.com/balena-os/balena-supervisor/issues/1685

jellyfish-bot commented 3 years ago

[20k-ultra] This issue has attached support thread https://jel.ly.fish/e0591dc4-8474-427b-bcfe-4715e36d795a

20k-ultra commented 3 years ago

We have merged in https://github.com/balena-os/balena-supervisor/pull/1692 which has added some improvements but I think we can further improve API error messages because the following example is pretty verbose with a lot of unnecessary info

Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error]   Error from the API: 503
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error]   Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your re>
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error]         at /usr/src/app/dist/app.js:22:554753
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error]       at runMicrotasks (<anonymous>)
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error]       at async /usr/src/app/dist/app.js:22:554042
Jun 02 14:40:07 203db23 resin-supervisor[4770]: [error]       at async /usr/src/app/dist/app.js:22:555620
cywang117 commented 3 years ago

Additionally, sometimes the error message is Non-200 response from the API! Status code: xxx - message: Error without anything else. I'm wondering if message-less errors are being sent from the cloud in some instances.

20k-ultra commented 3 years ago

There are a few status codes which an empty body is expected, 500s..504..503..204..304..etc. We could better handle those but it's meh. Removing the stack trace and including why the device wanted to make the call and if it will retry will be more useful.

203db23 [error]   Error from the API: 503
203db23 [error]   Device was unable to fetch the latest target state.
203db23 [error]   Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request"
203db23 [error]   Will retry in 5 seconds.
20k-ultra commented 3 years ago

following a Go idiomatic error is also an idea that was suggested which will explain what was attempted and why it failed in a single line.

error while doing X: failed to do Y: no disk space

I think the above example would translate to:

203db23 [error]   Unable to fetch latest target state: Network call to cloud API returned 503 - The API cannot handle your request
203db23 [error]   Will retry in 5 seconds

If the action does not have a retry attempt then the 2nd log is omitted. The template would be

203db23 [error]   ${action}: ${failure}

for network calls it's important to include the status code and body as it could contain an explanation for the status code. Therefore, ${failure} would be Network call to cloud API returned ${response.status} - ${response.body}

splitice commented 3 years ago

Even the additional information now available isnt particularly helpful. This is something I started seeing today (unsure of start time).

[error]   Error from the API: 503
[error]   Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request right now"
[error]         at sendReportPatch (/usr/src/app/dist/app.js:22:643411)
[error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error]       at async /usr/src/app/dist/app.js:22:644497
[error]       at async /usr/src/app/dist/app.js:22:645405
[error]   Error from the API: 503
[error]   Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request right now"
[error]         at sendReportPatch (/usr/src/app/dist/app.js:22:643411)
[error]       at runMicrotasks (<anonymous>)
[error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error]       at async /usr/src/app/dist/app.js:22:644497
[error]   Error from the API: 503
[error]   Non-200 response from the API! Status code: 503 - message: Error: "The API cannot handle your request right now"
[error]         at sendReportPatch (/usr/src/app/dist/app.js:22:643411)
[error]       at runMicrotasks (<anonymous>)
[error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
[error]       at async /usr/src/app/dist/app.js:22:644497

Also shouldnt balena-cloud be monitored for 503 rate?

cywang117 commented 3 years ago

@splitice How often have you noticed these logs and when did you start seeing them? balenaCloud is indeed monitored for 503 rate and there haven't been any noticeable increases in 503 in the last 7 days. 503 logging is expected so I wouldn't worry too much about this error message -- from the Supervisor's POV it just means that the Supervisor will retry the last request.

splitice commented 3 years ago

@cywang117 Some time between 3 months ago and today. I only check supervisor as part of our release QA process and I would have noticed that during the last QA. Doesnt appear to affect anything other than creating a bunch of Journal entries (potentially flushing out other useful messages, but thats minor).

I can confirm that updating to the latest balena-supervisor has no effect (as expected it's a server side error). We are moving from 12.1.5 to 12.10.10 in this update. Both show similar messages.

I'm not sure what the frequency of these requests are but I see those messages every few minutes.

cywang117 commented 3 years ago

@splitice We're looking at an improvement to the Supervisor code that will set a backoff timer when it encounters 503 in a response, so that the API isn't repeatedly hammered despite sending 503. This will hopefully reduce the frequency of 503's in the journal logs.

Another change that's beginner-friendly is to remove the stack trace from being logged (i.e., starting 2nd line with at sendReportPatch (/usr/src...) since it doesn't add more clarity and only causes user alarm. It also clogs journal entries, potentially flushing out useful logs as you say. Anyone who sees this, feel free to open a PR ;)

20k-ultra commented 3 years ago

https://github.com/balena-os/balena-supervisor/pull/1798 has made some improvements to current state reporting errors. For example, whenever there was a network issue the Supervisor logged [event] Event: Device state report failure {"error":{"message":""}} but now logs [event] Event: Device state report failure {"error":"getaddrinfo ENOTFOUND api.domain.com"}

cywang117 commented 2 years ago

Piggybacking off of @20k-ultra's comment above, you may find this more verbose error log by upgrading Supervisor via self-serve to v12.10.13 or later. See v12.10.13 changelog and the linked PR in the comment above.

splitice commented 2 years ago

@20k-ultra it's good timing too. Saw empty error messages yesterday due to a preference for IPv6 (which was down).

I've logged for hostos work to explore patching gai to prefer v4.

jellyfish-bot commented 1 year ago

[mpous] This has attached https://jel.ly.fish/4927d500-29d0-4a35-9632-913250502c2b