tbaur commented 1 year ago

Bug Report

Describe the Bug

API call hit a 502, and it crashed homebridge entirely. Logs below. This happened 3-4 times last night. Clearly the Ring API was having issues, but it shouldn't crash homebridge all together.

Screenshots/Logs

[2/17/2023, 4:54:44 AM] [Ring] Request to https://app.ring.com/api/v1/mode/location/0950258e-XXX failed with status 502. Response body:

502 Bad Gateway

[2/17/2023, 4:54:44 AM] HTTPError: Response code 502 (Bad Gateway) at Request. (/usr/local/lib/node_modules/homebridge-ring/node_modules/got/dist/source/as-promise/index.js:118:42) at processTicksAndRejections (node:internal/process/task_queues:95:5) [2/17/2023, 4:54:44 AM] Got SIGTERM, shutting down Homebridge...

Environment

OSX and current versions of node, npm, homebridge, and homebridge-ring

tbaur commented 1 year ago

More of the similar crashing.

[2/23/2023, 7:27:27 AM] [Ring] Request to https://app.ring.com/api/v1/mode/location/0950258e-XXX failed with status 400. Response body: envproxy unknown host

[2/23/2023, 7:27:27 AM] HTTPError: Response code 400 (Bad Request) at Request. (/usr/local/lib/node_modules/homebridge-ring/node_modules/got/dist/source/as-promise/index.js:118:42) at processTicksAndRejections (node:internal/process/task_queues:95:5) [2/23/2023, 7:27:27 AM] Got SIGTERM, shutting down Homebridge...

tbaur commented 1 year ago

And then after repeat issues, repeat crashing and homebridge restarts, the API then rejects and causes even further spinning.

[2/23/2023, 7:27:40 AM] [Ring] HTTPError: Response code 429 (Too Many Requests)

dgreif commented 1 year ago

@tbaur sorry I didn't see this issue sooner. The 429 is a rate-limiting response, likely because you restarted many times which caused a large number of requests. Are you still experiencing the 502 and 400 errors which originally caused you trouble?

tbaur commented 1 year ago

Yeah, I get the 429. Handling of the 502/400's seems to crash homebridge though, repeatedly, and with all the automatic service restarts it then hits the 429 and it doesn't recover well. Unclear why the 502/400's would crash the whole service. Perhaps some improvements in handling upstream API failures may help.

dgreif commented 1 year ago

Thanks for the update. In your second comment, I see envproxy unknown host as the error. Do you have some sort of Proxy or VPN set up? I haven't seen errors like this before, so I'm guessing it's something related to your network

tbaur commented 1 year ago

There is no proxy or vpn inline of homebridge on my side -- envproxy unknown host is in the response body. I'd guess that app.ring.com is a front end API gateway for backend services. My original point for raising this issue wasn't that app.ring.com had a problem -- it was the handling of the API call errors -- they shouldn't crash homebridge entirely. The 502 server side error triggers the crash, then restart crash repeat loops, and things go downhill with various 4xx codes.

tbaur commented 1 year ago

Here is another from this morning. Server side error and the plugin crashed homebridge.

[5/15/2023, 9:34:56 AM] [Ring] Request to https://app.ring.com/api/v1/mode/location/0950258e-XXXX failed with status 503. Response body: {"errors":[4007]} [5/15/2023, 9:34:56 AM] HTTPError: Response code 503 (Service Unavailable) at Request. (/usr/local/lib/node_modules/homebridge-ring/node_modules/got/dist/source/as-promise/index.js:118:42) at processTicksAndRejections (node:internal/process/task_queues:95:5) [5/15/2023, 9:34:56 AM] Got SIGTERM, shutting down Homebridge...

dgreif commented 1 year ago

@tbaur just to make sure I understand your intent with this issue - is the ring plugin generally working for you, but occasionally crashing or crash-looping if an unexpected server error is encountered?

tbaur commented 1 year ago

[5/26/2023, 3:16:50 PM] [Ring] Request to https://app.ring.com/api/v1/mode/location/0950258e-XXXX failed with status 503. Response body: {"errors":[4007]} [5/26/2023, 3:16:50 PM] HTTPError: Response code 503 (Service Unavailable) at Request. (/usr/local/lib/node_modules/homebridge-ring/node_modules/got/dist/source/as-promise/index.js:118:42) at processTicksAndRejections (node:internal/process/task_queues:95:5) [5/26/2023, 3:16:50 PM] Got SIGTERM, shutting down Homebridge...

==> restart and crash again:

[5/26/2023, 3:27:19 PM] [Ring] Request to https://app.ring.com/api/v1/mode/location/0950258e-XXXX failed with status 500. Response body: [5/26/2023, 3:27:19 PM] HTTPError: Response code 500 (Server Error) at Request. (/usr/local/lib/node_modules/homebridge-ring/node_modules/got/dist/source/as-promise/index.js:118:42) at processTicksAndRejections (node:internal/process/task_queues:95:5) [5/26/2023, 3:27:19 PM] Got SIGTERM, shutting down Homebridge...

tbaur commented 1 year ago

Yes, plugin works, until server side errors on ring's API crash homebridge.

tsightler commented 1 year ago

So this has been out here a long time and it doesn't seem to happen often enough to be critical, but I had a little time tonight and did some digging and I think this is simply because the request() function in rest-client.ts looks for specific status codes and retries them, but for any return codes that are not explicitly handled (401, 404, 504) it throws the error. However, as far as I can see, none of the callers of request() actually attempt to catch an error, so the unhandled promise rejection leads to the node process exiting since that is the default for modern node versions since v16 (maybe v15 during development). Prior to that, this probably was less of an issue.

I don't see this in ring-mqtt because I use process.on('unhandledRejection') to catch such events for a few other "normal" cases such as invalid token and make the error messages slight cleaner by not outputting the stack trace because otherwise users open cases about them, but it also has the advantage of catching these crashing cases. However, I don't think this approach is appropriate for a library.

I believe the proper fix is to implement error handling via a catch for every call to request, since it is certainly possible that any given request my actually fail. However, in ring-client-api alone there are around 70 callers to request() and certainly some library users will call request() directly for some things as well (I have at least a half-dozen such calls in ring-mqtt)

I don't have time to try to fix all of these right now so that leaves a few alternatives. The working assumption of most of the code has always been that a call to request() will retry automatically until it succeeds so my lean is to handle all 500 level responses the same as a gateway timeout.

This still leaves 400 errors, in general I don't think "Bad requests" errors should be retried, so I'm not sure what to do with them. Maybe just log but don't throw the error? I don't think that behavior would be worse than what the code did prior in Node v14 and earlier when it generated promise rejections but the node process didn't exit.

Fully open to other suggestions, is there some cleaner option I am missing?

tbaur commented 1 year ago

Yeah, it’s your call on how or if to solve, I respect that. I don’t feel that upstream errors should crash an entire service as a general rule and if/when it happens repeatedly here, you also start hitting upstream API lockouts on other services (via other plugins). I addressed it by moving the Ring plugin to a child bridge, so it can now crash all it wants and not impact the broader service. Thanks for looking into it more!

On Nov 4, 2023, at 6:48 PM, tsightler @.***> wrote:

So this has been out here a long time and it doesn't seem to happen often enough to be critical, but I had a little time tonight and did some digging and I think this is simply because the request() function in rest-client.ts looks for specific status codes and retries them, but for any return codes that are not explicitly handled (401, 404, 504) it throws the error. However, as far as I can see, none of the callers of request() actually attempt to catch an error, so the unhandled promise rejection leads to the node process exiting since that is the default for modern node versions since v16 (maybe v15 during development). Prior to that, this probably was less of an issue. I don't see this in ring-mqtt because I use process.on('unhandledRejection') to catch such events for a few other "normal" cases such as invalid token and make the error messages slight cleaner by not outputting the stack trace because otherwise users open cases about them, but it also has the advantage of catching these crashing cases. However, I don't think this approach is appropriate for a library. I believe the proper fix is to implement error handling via a catch for every call to request, since it is certainly possible that any given request my actually fail. However, in ring-client-api alone there are around 70 callers to request() and certainly some library users will call request() directly for some things as well (I have at least a half-dozen such calls in ring-mqtt) I don't have time to try to fix all of these right now so that leaves a few alternatives. The working assumption of most of the code has always been that a call to request() will retry automatically until it succeeds so my lean is to handle all 500 level responses the same as a gateway timeout. This still leaves 400 errors, in general I don't think "Bad requests" errors should be retried, so I'm not sure what to do with them. Maybe just log but don't throw the error? I don't think that behavior would be worse than what the code did prior in Node v14 and earlier when it generated promise rejections but the node process didn't exit. Fully open to other suggestions, is there some cleaner option I am missing? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

tsightler commented 1 year ago

To be fair, it's more because Node changed behavior that it crashes now, since they made unhandled rejections exit by default sometime after NodeJS 14. You can always run node with --unhandled-rejections=warn which was the behavior when the code was originally written many years ago.

Not that I'm disagreeing with you, but I just don't have time to go hunt down every consumer and put in proper error handling. I'm currently testing out just not throwing an error, but just logging it.

tbaur commented 1 year ago

Makes sense! Thanks.

On Nov 5, 2023, at 11:59 AM, tsightler @.***> wrote:

To be fair, it's more because Node changed behavior that it crashes now, since they made unhandled rejections exit by default sometime after NodeJS 14. You can always run node with --unhandled-rejections=warn which was the behavior when the code was originally written many years ago. Not that I'm disagreeing with you, but I just don't have time to go hunt down every consumer and put in proper error handling. I'm currently testing out just not throwing an error, but just logging it. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

dgreif / ring

Request to location API failed (502) and Homebridge crashed entirely #1160

Bug Report

Describe the Bug

Screenshots/Logs

502 Bad Gateway

Environment