echoboomer / incident-bot

The Open Source Incident Management Framework
MIT License
117 stars 41 forks source link

RCA confluence failure to create #326

Closed LanceSandino closed 11 months ago

LanceSandino commented 1 year ago

We've run into this quite a few times recently (randomly), haven't been able to determine the cause. Although if I had to guess, it was related to this other issue as it might be happening when images are pinned.

incident-bot 09-08 15:30:03 INFO:incident.actions:Creating rca channel: inc-20238241622-user-replay-deleted-companyusers-that-should-still-exist-rca
incident-bot 09-08 15:30:04 INFO:slack.client:User already in channel or is one of ['api', 'web']. Skipping invite.
incident-bot 09-08 15:30:04 INFO:confluence:Creating RCA 2023-09-08 - inc-20238241622-user-replay-deleted-companyusers-that-should-still-exist - User Replay Deleted Companyusers That Should Still Exist in Confluence space IR under parent 2023...
incident-bot 09-08 15:30:09 INFO:atlassian.confluence:Creating page "IR" -> "2023-09-08 - inc-20238241622-user-replay-deleted-companyusers-that-should-still-exist - User Replay Deleted Companyusers That Should Still Exist"
incident-bot 09-08 15:30:09 ERROR:confluence:com.atlassian.confluence.api.service.exceptions.BadRequestException: Error parsing xhtml: Unexpected character '@' (code 64) in content after '<' (malformed start element?).
incident-bot 09-08 15:30:09  at [row,col {unknown-source}]: [167,73]
incident-bot 09-08 15:30:09 ERROR:incident.actions:Error sending RCA update to RCA channel: The request to the Slack API failed. (url: https://www.slack.com/api/chat.postMessage)
incident-bot 09-08 15:30:09 The server responded with: {'ok': False, 'error': 'invalid_blocks', 'errors': ['must provide a string [json-pointer:/blocks/4/elements/0/url]'], 'response_metadata': {'messages': ['[ERROR] must provide a string [json-pointer:/blocks/4/elements/0/url]']}}
incident-bot 09-08 15:30:09 INFO:incident.actions:Sent resolution info to inc-20238241622-user-replay-deleted-companyusers-that-should-still-exist.
incident-bot 09-08 15:30:10 INFO:incident.actions:Updating incident record in database with new status for inc-20238241622-user-replay-deleted-companyusers-that-should-still-exist
incident-bot 09-08 15:30:11 INFO:incident.actions:Updated incident status for inc-20238241622-user-replay-deleted-companyusers-that-should-still-exist to resolved.
incident-bot 09-08 15:31:23 INFO:incident.actions:Sending chat transcript to inc-20238241622-user-replay-deleted-companyusers-that-should-still-exist.
incident-bot 09-08 15:31:30 WARNING:slack_bolt.App:Unhandled request ({'type': 'block_actions', 'block_id': 'resolution_buttons', 'action_id': 'n8u'})
incident-bot 09-08 15:31:30 ---
incident-bot 09-08 15:31:30 [Suggestion] You can handle this type of event with the following listener function:
incident-bot 09-08 15:31:30
incident-bot 09-08 15:31:30 @app.action("n8u")
incident-bot 09-08 15:31:30 def handle_some_action(ack, body, logger):
incident-bot 09-08 15:31:30     ack()
incident-bot 09-08 15:31:30     logger.info(body)
incident-bot 09-08 15:31:30
LanceSandino commented 1 year ago

I tried to "Resolve" the incident again and then I got the following error:

incident-bot 09-08 15:56:35 ERROR:incident.actions:Error creating rca channel: The request to the Slack API failed. (url: https://www.slack.com/api/conversations.create)
incident-bot 09-08 15:56:35 The server responded with: {'ok': False, 'error': 'name_taken'}
incident-bot 09-08 15:56:35 ERROR:slack_bolt.App:Error: cannot access local variable 'rca_channel' where it is not associated with a value
incident-bot 09-08 15:56:35 Traceback (most recent call last):
incident-bot 09-08 15:56:35   File "/usr/local/lib/python3.11/site-packages/slack_bolt/listener/thread_runner.py", line 120, in run_ack_function_asynchronously
incident-bot 09-08 15:56:35     listener.run_ack_function(request=request, response=response)
incident-bot 09-08 15:56:35   File "/usr/local/lib/python3.11/site-packages/slack_bolt/listener/custom_listener.py", line 50, in run_ack_function
incident-bot 09-08 15:56:35     return self.ack_function(
incident-bot 09-08 15:56:35            ^^^^^^^^^^^^^^^^^^
incident-bot 09-08 15:56:35   File "/incident-bot/bot/slack/handler.py", line 173, in handle_incident_set_status
incident-bot 09-08 15:56:35     asyncio.run(inc_actions.set_status(action_parameters=parse_action(body)))
incident-bot 09-08 15:56:35   File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
incident-bot 09-08 15:56:35     return runner.run(main)
incident-bot 09-08 15:56:35            ^^^^^^^^^^^^^^^^
incident-bot 09-08 15:56:35   File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
incident-bot 09-08 15:56:35     return self._loop.run_until_complete(task)
incident-bot 09-08 15:56:35            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
incident-bot 09-08 15:56:35   File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
incident-bot 09-08 15:56:35     return future.result()
incident-bot 09-08 15:56:35            ^^^^^^^^^^^^^^^
incident-bot 09-08 15:56:35   File "/incident-bot/bot/incident/actions.py", line 380, in set_status
incident-bot 09-08 15:56:35     "id": rca_channel["channel"]["id"],
incident-bot 09-08 15:56:35           ^^^^^^^^^^^
incident-bot 09-08 15:56:35 UnboundLocalError: cannot access local variable 'rca_channel' where it is not associated with a value

Also just created a test incident and the text and image I uploaded was fine.... so not sure... but I still think it may be related. All the RCAs that failed had images... just thought i'd throw that out there.

Is there some sort of script I can run to force create an RCA? Or will it just have to be manually?

LanceSandino commented 1 year ago

Had this issue again today on a new incident:

incident-bot 09-11 12:43:53 WARNING:atlassian.jira:Creating issue "CMS returning 500 error when attempting to sign in"

incident-bot 09-11 12:53:31 INFO:incident.actions:Creating rca channel: inc-20239111555-cms-throwing-500-errors-rca
incident-bot 09-11 12:53:31 INFO:slack.client:User already in channel or is one of ['api', 'web']. Skipping invite.
incident-bot 09-11 12:53:32 INFO:confluence:Creating RCA 2023-09-11 - inc-20239111555-cms-throwing-500-errors - Cms Throwing 500 Errors in Confluence space IR under parent 2023...
incident-bot 09-11 12:53:36 INFO:atlassian.confluence:Creating page "IR" -> "2023-09-11 - inc-20239111555-cms-throwing-500-errors - Cms Throwing 500 Errors"
incident-bot 09-11 12:53:36 ERROR:confluence:com.atlassian.confluence.api.service.exceptions.BadRequestException: Error parsing xhtml: Unexpected character '/' (code 47) (expected a name start character)
incident-bot 09-11 12:53:36  at [row,col {unknown-source}]: [177,8]
incident-bot 09-11 12:53:37 ERROR:incident.actions:Error sending RCA update to RCA channel: The request to the Slack API failed. (url: https://www.slack.com/api/chat.postMessage)
incident-bot 09-11 12:53:37 The server responded with: {'ok': False, 'error': 'invalid_blocks', 'errors': ['must provide a string [json-pointer:/blocks/4/elements/0/url]'], 'response_metadata': {'messages': ['[ERROR] must provide a string [json-pointer:/blocks/4/elements/0/url]']}}
incident-bot 09-11 12:53:37 INFO:incident.actions:Sent resolution info to inc-20239111555-cms-throwing-500-errors.
incident-bot 09-11 12:53:38 INFO:incident.actions:Updating incident record in database with new status for inc-20239111555-cms-throwing-500-errors
incident-bot 09-11 12:53:38 INFO:incident.actions:Updated incident status for inc-20239111555-cms-throwing-500-errors to resolved.
incident-bot 09-11 12:53:43 INFO:incident.actions:Sending chat transcript to inc-20239111555-cms-throwing-500-errors.
incident-bot 09-11 12:53:43 /usr/local/lib/python3.11/site-packages/slack_sdk/web/client.py:3074: UserWarning: Although the channels parameter is still supported for smooth migration from legacy files.upload, we recommend using the new channel parameter with a single str value instead for more clarity.
incident-bot 09-11 12:53:43   warnings.warn(
incident-bot 09-11 12:53:43 /usr/local/lib/python3.11/site-packages/slack_sdk/web/client.py:3086: UserWarning: The filetype parameter is no longer supported. Please remove it from the arguments.
incident-bot 09-11 12:53:43   warnings.warn("The filetype parameter is no longer supported. Please remove it from the arguments.")
incident-bot 09-11 12:53:53 WARNING:slack_bolt.App:Unhandled request ({'type': 'block_actions', 'block_id': 'resolution_buttons', 'action_id': '2lBn'})
incident-bot 09-11 12:53:53 ---
incident-bot 09-11 12:53:53 [Suggestion] You can handle this type of event with the following listener function:
incident-bot 09-11 12:53:53
incident-bot 09-11 12:53:53 @app.action("2lBn")
incident-bot 09-11 12:53:53 def handle_some_action(ack, body, logger):
incident-bot 09-11 12:53:53     ack()
incident-bot 09-11 12:53:53     logger.info(body)
incident-bot 09-11 12:53:53

No images this time, it was all links w/ link previews (datadog logs to be specific), quoted text blocks.

ImDevinC commented 1 year ago

Can you enable debug logging by setting the environment variable LOGLEVEL to DEBUG and then reproduce and post logs?

LanceSandino commented 1 year ago

Umm… well I haven’t been able to reproduce it… and I’d have to turn off datadog logs so it doesn’t get spammed with logs.

I’ll look into doing that today 😀

LanceSandino commented 11 months ago

Okay I was able to get debug logs, can I share them with you or @echoboomer privately as there may be some sensitive data that I can't really remove due to it likely being the cause 😅

LanceSandino commented 11 months ago

AH!!! I think I found it... It's caused by tagging someone in a message and then pinning it...

It shows up as <@AC123> and it likely is thinking it's invalid html

incident-bot 09-20 12:17:24 INFO:atlassian.confluence:Creating page "IncBotDev" -> "2023-09-20 - inc-20239201616-test-rca-username - Test Rca Username"
incident-bot 09-20 12:17:24 DEBUG:atlassian.rest_client:curl --silent -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' --data '"{\"type\": \"page\", \"title\": \"2023-09-20 - inc-20239201616-test-rca-username - Test Rca Username\", \"space\": {\"key\": \"IncBotDev\"}, \"body\": {\"storage\": {\"value\": \"\\n<table data-layout=\\\"default\\\" ac:local-id=\\\"32625732-f824-4919-afd4-4492029881c4\\\">\\n  <colgroup>\\n    <col style=\\\"width: 340.0px;\\\" />\\n    <col style=\\\"width: 340.0px;\\\" />\\n  </colgroup>\\n  <tbody>\\n    <tr>\\n      <td data-highlight-colour=\\\"#f4f5f7\\\">\\n        <p><strong>Role</strong></p>\\n      </td>\\n      <td data-highlight-colour=\\\"#f4f5f7\\\">\\n        <p><strong>Participants</strong></p>\\n      </td>\\n    </tr>\\n    <tr>\\n      <td>\\n        <p>Incident Commander</p>\\n      </td>\\n      <td>\\n        Lance Sandino\\n      </td>\\n    </tr>\\n    <tr>\\n      <td>\\n        <p>Contributors</p>\\n      </td>\\n      <td>\\n        <p>Tag other people that participated in the resolution of the incident here.</p>\\n      </td>\\n    </tr>\\n  </tbody>\\n</table>\\n\\n<h2>Summary</h2>\\n\\n<ac:structured-macro ac:name=\\\"info\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"42e3bea3-c2d1-4c7e-a040-0dfae2139367\\\">\\n  <ac:rich-text-body>\\n    <p>This incident was classified as a <b>sev4</b> incident.</p>\\n    <p>Incident impacting one customer: Single user - would be that 1 person would be impacted</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n\\n<ac:structured-macro ac:name=\\\"note\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"04623358-bff8-4d5a-8a3c-3a7f5d23f394\\\">\\n  <ac:rich-text-body>\\n    <p>A summary of the impact of this incident should go here.</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n\\n<h2>User Impact</h2>\\n<ac:structured-macro ac:name=\\\"note\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"7eaa42fd-7637-4cf8-ab27-dc2294299535\\\">\\n  <ac:rich-text-body>\\n    <p>Describe how this incident affected users. Summarize answers to these two questions:</p>\\n    <ul>\\n      <li>\\n        <p>Was the service from the point of view of the user running in a degraded state?</p>\\n      </li>\\n      <li>\\n        <p>What else?</p>\\n      </li>\\n    </ul>\\n    <p>Full details can be added to the incident description.</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n\\n<h1>Timeline</h1>\\n<table data-layout=\\\"default\\\" ac:local-id=\\\"2e552c4f-e702-4ac8-aab9-5c60497123ba\\\">\\n  <colgroup>\\n    <col style=\\\"width: 340.0px;\\\" />\\n    <col style=\\\"width: 340.0px;\\\" />\\n  </colgroup>\\n  <tbody>\\n    <tr>\\n      <td data-highlight-colour=\\\"#f4f5f7\\\">\\n        <p><strong>Time</strong></p>\\n      </td>\\n      <td data-highlight-colour=\\\"#f4f5f7\\\">\\n        <p><strong>Event</strong></p>\\n      </td>\\n    </tr>\\n    \\n    <tr>\\n        <td>\\n            <p>2023-09-20T16:16:52 UTC</p>\\n        </td>\\n        <td>\\n            <p>Incident created.</p>\\n        </td>\\n    </tr>\\n    \\n    <tr>\\n        <td>\\n            <p>2023-09-20T16:17:18 UTC</p>\\n        </td>\\n        <td>\\n            <p>Status was changed to resolved.</p>\\n        </td>\\n    </tr>\\n    \\n    <tr>\\n        <td>\\n            <p>2023-09-20T16:17:18 UTC</p>\\n        </td>\\n        <td>\\n            <p>RCA channel was created.</p>\\n        </td>\\n    </tr>\\n    \\n    <tr>\\n        <td>\\n            <p>&hellip;</p>\\n        </td>\\n        <td>\\n            <p>&hellip;</p>\\n        </td>\\n    </tr>\\n    \\n  </tbody>\\n</table>\\n\\n<h1>Incident Description</h1>\\n<ac:structured-macro ac:name=\\\"note\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"4671536b-efcb-49a9-a06a-373c3ac054d0\\\">\\n  <ac:rich-text-body>\\n    <p>Longer description of the problem with screenshots/links to help readers understand the entire incident.</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n\\n<h1>Root Cause</h1>\\n<ac:structured-macro ac:name=\\\"note\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"ccc56469-8954-4844-a678-dd4e77a9d285\\\">\\n  <ac:rich-text-body>\\n    <p>Explain the root cause of the issue.</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n\\n<h1>Actions</h1>\\n\\n<h2>Immediate Actions</h2>\\n<ac:structured-macro ac:name=\\\"note\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"794384f4-c6e5-4222-bc81-1ca208e70044\\\">\\n  <ac:rich-text-body>\\n    <p>Actions to mitigate the impact of the incident directly following declaration should be listed here.</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n\\n<h2>Preventive Actions</h2>\\n<ac:structured-macro ac:name=\\\"note\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"71e51f0d-3d85-4201-ae3c-1324c6a02be8\\\">\\n  <ac:rich-text-body>\\n    <p>What can be implemented to avoid this condition in the future?</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n\\n<h1>Pinned Messages</h1>\\n<ac:structured-macro ac:name=\\\"info\\\" ac:schema-version=\\\"1\\\" ac:macro-id=\\\"51c41231-da8e-49ef-a512-097b28cd2bd3\\\">\\n  <ac:rich-text-body>\\n    <p>These messages were pinned during the incident by users in Slack.</p>\\n    <p>This information is useful for establishing the incident timeline and providing diagnostic data.</p>\\n  </ac:rich-text-body>\\n</ac:structured-macro>\\n<blockquote><p><strong>Lance Sandino @ 20/09/2023 16:17:12 UTC - </strong> hello testing <@U01BN20PPEG></p></blockquote><p /><blockquote><p><strong>Lance Sandino @ 20/09/2023 16:17:14 UTC - </strong> this is a test of pinging someone and tagging 2 <@U04SDQ1AQKH></p></blockquote><p />\\n<ac:structured-macro ac:name=\\\"attachments\\\" ac:schema-version=\\\"1\\\" data-layout=\\\"wide\\\"\\n  ac:local-id=\\\"<REDACTED>\\\" ac:macro-id=\\\"<REDACTED>\\\" />\\n\", \"representation\": \"storage\"}}, \"metadata\": {\"properties\": {\"editor\": {\"value\": \"v2\"}}}, \"ancestors\": [{\"type\": \"page\", \"id\": \"8688304444\"}]}"' 'https://<REDACTED>.atlassian.net/wiki/rest/api/content'
incident-bot 09-20 12:17:24 DEBUG:urllib3.connectionpool:https://<REDACTED>.atlassian.net:443 "POST /wiki/rest/api/content HTTP/1.1" 400 None
incident-bot 09-20 12:17:24 DEBUG:atlassian.rest_client:HTTP: POST rest/api/content/ -> 400 Bad Request
incident-bot 09-20 12:17:24 DEBUG:atlassian.rest_client:HTTP: Response text -> {"statusCode":400,"data":{"authorized":true,"valid":true,"errors":[],"successful":true},"message":"com.atlassian.confluence.api.service.exceptions.BadRequestException: Error parsing xhtml: Unexpected character '@' (code 64) in content after '<' (malformed start element?).\n at [row,col {unknown-source}]: [158,91]"}
incident-bot 09-20 12:17:24 ERROR:confluence:com.atlassian.confluence.api.service.exceptions.BadRequestException: Error parsing xhtml: Unexpected character '@' (code 64) in content after '<' (malformed start element?).
incident-bot 09-20 12:17:24  at [row,col {unknown-source}]: [158,91]
image
echoboomer commented 11 months ago

@LanceSandino Just pushed out release v1.4.23 to address the RCA create issue. If you bump to this version, you can test both the pager fix and the RCA doc fix. Thanks again!

LanceSandino commented 11 months ago

Hi.

Thank you for this work! Seems like it creates the RCA but the text has no spaces

image image