Closed jinpingh closed 5 years ago
If we talk about the failure which could be potentially recovered by connection retries/or retries with backoff, - we can consider using that instead.
We have this behavior between other microservices of ST2, so it would be nice to have the same behavior in st2chatops. But code-wise, it would be frustrating to keep the semantics of "retry with backoff" in st2chatops up-to-date with how it works in the rest of ST2.
But if nothing helped, - always exit with error message and non-zero exit code.
This is the easiest thing for us to do. At this point, I think this behavior is a good compromise between trying to behave like the other ST2 services and increased complexity of this codebase/PR.
Should close #77 if st2chatops gets automatically restarted, which I think it will in a kubernetes environment.
Thanks @jinpingh and @blag for jumping in! I'll test the packages from this PR in K8s/Docker environment once again to verify it and will get back to you.
@armab
we significantly changed the logic here
Yep.
and overengineered it
I disagree. The part of this PR that is fixing the issue is the proper way to implement this change according to hubot's documentation on error handling. The code churn that has happened as a result is necessary to ensure that places where connection issues could happen will all cause hubot to shutdown. Additionally, we only wanted to have one way to gracefully shutdown instead of spaghetti code shutting it down in multiple places.
removing existing self-healing mechanisms
Yep. In a k8s environment, gracefully exiting and relying on k8s or the process manager to restart the process from scratch is a self-healing mechanism. ChatOps is not part of the rest of ST2 microservices, and it has to re-authenticate when it starts up.
to re-load and recover hubot with
ST2_COMMANDS_RELOAD_INTERVAL
setting.
As long as the connection to StackStorm is fine, this behavior is unchanged: st2chatops will continue to reload commands on a configurable interval. If it cannot connect to StackStorm, the process will die, hopefully be respawned, and the new instance will connect again.
The other option, which is what we originally had, was to make the shutdown/respawn/reconnect behavior configurable via an environment variable, but you didn't like that solution either, and that's why we refactored it to this behavior.
I really meant to exit app on unrecoverable failures (when reload stopped to work and hubot was just hanging in noop) by adding some kind of handler for it.
Yes, that's precisely what this PR implements, in a modern, proper manner, following Hubot documentation, including tests, and a minimum amount of refactoring to ensure that errors are not swallowed and that they do propagate up to the shutdown-on-error logic.
If it cannot connect to StackStorm, the process will die, hopefully be respawned, and the new instance will connect again.
Previously, if hubot couldn't connect to st2 the process would continue to re-connect in a loop. That's the correct behavior as networking can go up/down.
This PR changes the original behavior. By adding logErrorAndExit
in places where I see them could lead to cases when it's wasn't ideal to log & exit, eliminating existing re-load capabilities that happens on ST2_COMMANDS_RELOAD_INTERVAL
.
@armab Sorry, I wasn't clear on what you wanted earlier. I have fixed the code where you have pointed it out. Please re-review.
This code needs to handle reconnecting better, but that is outside the scope of this PR.
Did run test case posted on StackStorm/st2chatops#124 (comment). With yesterday fix, for invalid ST2_API_URL , ST2_STREAM_URL, ST2_AUTH_TOKEN and ST2_API_KEY
, st2chatops logs out error message and keep waiting. And error message for ST2_STREAM_URL, ST2_AUTH_TOKEN and
ST2_API_KEY` are the same. Is this the result that we wanted?
@jinpingh To understand exactly what do you mean, can you please paste the commands to reproduce? I'm also looking at the test cases in this PR which helped everyone to be on the same page.
Also, "keep waiting" turned out to have diffs. In reported https://github.com/StackStorm/st2chatops/issues/124 the original issue was hanging our process in unrecoverable and waiting state. If reload loop happens (every 2 mins), - it's a different state.
@jinpingh To understand exactly what do you mean, can you please paste the commands to reproduce? I'm also looking at the test cases in this PR which helped everyone to be on the same page.
Also, "keep waiting" turned out to have diffs. In reported StackStorm/st2chatops#124 the original issue was hanging our process in unrecoverable and waiting state. If reload loop happens (every 2 mins), - it's a different state.
@jinpingh To understand exactly what do you mean, can you please paste the commands to reproduce? I'm also looking at the test cases in this PR which helped everyone to be on the same page.
@armab The way to run testcases is to modify /opt/stackstorm/chatops/st2chatops.env
file for StackStorm settings
, then run /opt/stackstorm/chatops/bin/hubot
I think at least st2chatops
should be exit for invalid ST2 Configurations. If ST2Stream
up and down by other reason, then st2chatops
should try the best effort to recovery not just quit.
@armab Sorry to keep pulling you back to this, but I figured out how to test this all, and I think it's finally in a mergeable state!
Coverage from current master
:
------------------------|----------|----------|----------|----------|-------------------|
File | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s |
------------------------|----------|----------|----------|----------|-------------------|
All files | 82.73 | 79.59 | 83.76 | 82.24 | |
lib | 96.88 | 96.68 | 97.75 | 96.75 | |
command_factory.js | 100 | 100 | 100 | 100 | |
format_command.js | 100 | 100 | 100 | 100 | |
format_data.js | 100 | 100 | 100 | 100 | |
post_data.js | 100 | 100 | 100 | 100 | |
slack-messages.js | 88.89 | 81.58 | 91.67 | 87.64 |... 34,35,37,38,43 |
slack_monkey_patch.js | 28.57 | 80 | 50 | 28.57 | 17,18,19,28,29 |
utils.js | 100 | 100 | 100 | 100 | |
scripts | 41.81 | 37.11 | 39.29 | 41.81 | |
stackstorm.js | 41.81 | 37.11 | 39.29 | 41.81 |... 04,405,406,431 |
------------------------|----------|----------|----------|----------|-------------------|
Coverage from this branch:
------------------------|----------|----------|----------|----------|-------------------|
File | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s |
------------------------|----------|----------|----------|----------|-------------------|
All files | 87.9 | 86.34 | 86.67 | 87.56 | |
lib | 99 | 99.57 | 98.86 | 98.96 | |
command_factory.js | 100 | 100 | 100 | 100 | |
format_command.js | 100 | 100 | 100 | 100 | |
format_data.js | 100 | 100 | 100 | 100 | |
post_data.js | 100 | 100 | 100 | 100 | |
slack-messages.js | 100 | 100 | 100 | 100 | |
slack_monkey_patch.js | 28.57 | 80 | 50 | 28.57 | 17,18,19,40,41 |
utils.js | 100 | 100 | 100 | 100 | |
scripts | 59.28 | 58.56 | 53.13 | 59.28 | |
stackstorm.js | 59.28 | 58.56 | 53.13 | 59.28 |... 96,425,461,462 |
------------------------|----------|----------|----------|----------|-------------------|
👍
Fixes https://github.com/StackStorm/st2chatops/issues/124
st2chatops
.robot.error
to capture uncaught error from hubot.Closes stackstorm/st2tests#170