TritonDataCenter / rfd

Requests for Discussion
Mozilla Public License 2.0
258 stars 80 forks source link

RFD 124 discussion #86

Open jordanhendricks opened 6 years ago

jordanhendricks commented 6 years ago

This issue represents an opportunity for discussion of RFD 124 Manta Incident Response Guide while it remains in a pre-published state.

jmwski commented 6 years ago

Thanks for writing this up! I have a couple of questions/wishes(?).

For problems of excessive latency in the data path, we have timers in webapi logs that can help track down the source of latency.

These are useful, but I would really like a way to sample this data from multiple muskie instances and filter by API operation as well as the shards that the request touches. I think it would be great to be able to quickly answer questions like:

  1. How many putobject requests hit shard X in the last time interval?
  2. How many putobject requests spent more than T ms streaming data to shark S?
  3. How much time did the last N uploadpart requests spend in 'saveMetadata'?
  4. Which timer had the largest value, on average, over the last N getobject requests?
  5. What does the 'enforceDirectoryCounts' time-series look like for the last N uploadpart operations?

If we could answer those questions quickly we could also correlate the answers with increased 500s that we already have in grafana, for example. I think engineers who have experience debugging incidents have commands and scripts that they can use to quickly answer these questions, but it's not necessarily the case for everyone. Is our monitoring today capable of answering these questions? If not, could it be?

We propose that the Investigation Coordinator report this every half hour while the incident is ongoing

Such an update, along with raw data, could be posted in ~incidents-spc or ~incidents-jpc, but it will quickly get lost and require scroll-back because those channels tend to be really busy during incidents. One could find oneself in a situation that requires thrashing between what people are adding to the channel and what has been added 5 pages ago. Could this report also be in some fixed location (that could be linked to in the heading of incident chatroom) that the incident manager is responsible for keeping up to date?

I think what I'm getting at is a general solution to the problem of distilling important hypotheses, symptoms, and data (gist links, links to screenshots that should be readily accessible to all responders) from incomplete developments that are still being formulated in chat. In this sense the chat would serve as a staging area for hypotheses, symptoms, and data that are interesting but that the team has not necessarily confirmed are correlated with the problem yet. Having a set of important results that all responders agree on can also help justify other investigations. It can also make it easy for new responders to harden those important results with supporting evidence, or question them with contradictory results (at which point they might be removed from the list).

jordanhendricks commented 6 years ago

@IanWyszynski: Thanks for taking a look and for your thoughtful feedback. Sorry for the response delay.

These are useful, but I would really like a way to sample this data from multiple muskie instances and filter by API operation as well as the shards that the request touches. I think it would be great to be able to quickly answer questions like...

We've talked about this some offline. It sounds like you were working on a tool to answer these questions. We also discussed dragnet, which I haven't used much myself but seems to offer a lot of that functionality as well.

This is outside of the scope of this RFD, but as a sibling project to it, @davepacheco and I discussed creating a debugging guide for Manta, with a particular eye toward incident response. How to answer questions like the ones you list I think is definitely in the scope of the document, though we would need to take care to make it general enough that we can keep it updated reasonably, but specific enough that it can be useful during incidents.

Such an update, along with raw data, could be posted in ~incidents-spc or ~incidents-jpc, but it will quickly get lost and require scroll-back because those channels tend to be really busy during incidents. One could find oneself in a situation that requires thrashing between what people are adding to the channel and what has been added 5 pages ago. Could this report also be in some fixed location (that could be linked to in the heading of incident chatroom) that the incident manager is responsible for keeping up to date?

Yeah, this a difficult thing to balance. The important part about the periodic summaries, as I see it, is to provide a baseline for everyone involved: new people joining the investigation, re-calibration for those already investigating, and providing meaningful information to people who need to notify customers about what's going on. Ideally, I don't think a new person joining an incident should need to do much scrollback beyond the last summary to understand what's going on and how they can jump in. That said, I also don't think it's unreasonable to paste the 30-minute summaries into the relevant JIRA ticket so that anyone can go back and read a concise summary of what's been investigated so far. I think that might address the scrollback issue you mention. Right now the open incidents are linked in the channel header, so it would be easy to find the summaries, as the ticket is also easy to fine. Having the summaries distilled in a single place might make it easier to re-construct the investigation when doing post-mortems. But a tradeoff here is that updating the ticket along with chat is another thing for the incident coordinator to have to remember to do and could easily be forgotten.

As a side note, I wanted to clarify that the role of "incident manager" already exists and is used at Joyent, and I believe is managed by someone in support. The role proposed in this RFD has a similar title, but a slightly different purpose.

princesspretzel commented 5 years ago

One thing that came up during the SRE book meeting for chapter 14, "Managing Incidents", was the idea of running monthly PagerDuty fire drills to make sure that everyone is receiving pages. @dekobon suggested a spreadsheet with people's names and the date of the page, where people check off that they heard the noise. I wonder if there might be a way to pull this information from the PagerDuty API: https://v2.developer.pagerduty.com/page/api-reference#!/Incidents/get_incidents_id (which might be useful during a real incident). @numericillustration suggested that we would need @psteinbachs to weigh in before proceeding, and then reach out to the NOC to run the drills.