PostHog / meta

This is a place to discuss non-product issues in public.
MIT License
18 stars 4 forks source link

RFC: Should we auto-escalate tickets that near an SLA breach? #238

Closed joethreepwood closed 2 months ago

joethreepwood commented 3 months ago

Context

Yesterday, @raquelmsmith suggested we should consider auto-escalating tickets if an SLA breach nears. There's some light resistance to this, but also some agreement. This conversation arose from the fact that currently there is widespread confusion about how engineers should handle tickets, which views they should look at, whether they should look at unescalated tickets, etc. This should be mostly explained in the handbook, but it can be confusing -- an automatic solution could help cut through this.

I honestly can't make my mind up. So, it's time to think out loud.

What do SLAs look like currently?

SLAs are currently in a good place for most teams - with only feature success and product analytics being cause for concern over the last 30 days.

Screenshot 2024-08-22 at 10 04 15

Even then, product analytics has recently changed their support hero process to improve, and I've recently spoken with @Phanatic to uncover why FS had a high breach %. We've made some changes and are now testing an SLA notifier for the FS team which we're confident will get us to a better place.

Right now we have an 81% achievement rate for SLAs as a company. If we bring FS in line with other teams at a comparable volume this will rise to 85%. We're also hiring to help spread the load and improve further, plus thinking of some more out-there ideas to improve ticket volumes on the whole.

In short, SLAs are not currently an emergency - but we can always do better. Auto-escalations could help with that.

Reasons to consider this

Reasons not to consider this

Are there alternatives we could consider?

The root cause for this generally seems to be that engineers are not always clear on which tickets they should be looking at. This has led us to having a large number of Zendesk views and needing engineers to constantly juggle between (Open) and (Escalated) - as well as other views for edge cases, such as (Batch Exports), Your Unsolved Tickets, (On-hold), Open High Priority Customers, and so on.

We've offset some of this complexity by outlining prioritization in the handbook and by reinforcing the correct approach with training - but this also inflates the complexity a little. What if we tackled the root cause and simplified the way engineers look at tickets?

Two ways we could do this:

A final alternative would be to roll the SLA notifier we're testing with the FS team out to all teams. The advantage there is that it's more flexible for people and less reliant on process. The disadvantage is that it is noisy.

There are some +s and -s to the above alternatives too, but I think they should speak for themselves (and this is a long issue anyway). So, throwing it out for feedback as is.

MarconLP commented 3 months ago

We could end up escalating tickets which don't actually need engineer involvement (60% of tickets are one-touch)

I feel like this is a pretty major reason against it. If we look back at the last 3 weeks (including this one / focussing on un-escalated product analytics tickets):

The root cause for this generally seems to be that engineers are not always clear on which tickets they should be looking at.

This is probably what we should solve in the first place. Every team should have a single escalated queue for the support hero and an un-escalated queue for the support engineers.

We go a step further and give every new hire an individual view which collects tickets for their groups, which would help address the situation where an engineer may spread over multiple groups (e.g. @benjackwhite has to be in CDP and Security). Everyone only has to look in one place and it's tailored to them.

I don't think this is a great idea. From a support engineer's view, I want to focus on one queue at a time, without having to context switch after each ticket.

joethreepwood commented 3 months ago

I don't think this is a great idea. From a support engineer's view, I want to focus on one queue at a time, without having to context switch after each ticket.

In this case though we could have separate queues for support engineers? Although then we're basically taking a step back to the current situation so 🤷

abigailbramble commented 3 months ago

What if we tackled the root cause and simplified the way engineers look at tickets?

I agree with this. The main issue raised from engineering was that it's unclear what they should be looking at. I think we should be looking to simplify the engineer's workflow as much as possible, so that they are less likely to miss important things and less likely to spend time context switching and searching around for what they need to be working on next. (related to your point on less shipping).

We go a step further and give every new hire an individual view which collects tickets for their groups, which would help address the situation where an engineer may spread over multiple groups (e.g. @benjackwhite has to be in CDP and Security). Everyone only has to look in one place and it's tailored to them.

I don't know if this is related to my previous comment about "I think we could come up with a way to have a single view tailored to the agent", but this isn't exactly what I meant.

What I would suggest:

  1. We create skills in Zendesk which exactly match the product areas i.e. one skill is CDP, another is Security, another is Product Analytics etc.
  2. These skills are set on the tickets when they are created, and changed when the tickets are reassigned to a different group.
  3. We set the skills on the agent groups and add agents into those groups, or we set skills directly on agents.
  4. We create a single view for engineering. One. This view is a skills-based view which updates depending on the agent who is currently viewing it and what skills they have assigned to their profile. i.e. if I went to the view it would only show Product Analytics related things, but if @benjackwhite goes to that view, he sees both CDP and Security related things, but not Product Analytics things etc.
  5. We should discuss what the formula is for things that should appear on that view i.e. if they should only be escalated things, if there should also be all sla_warning_sent things, or if we create some other way of determining if engineers need to look at any tickets which aren't escalated.
  6. We create the means for any required groupings; to be able to group and sort the engineering view so that engineers can simply move to the next highest ticket on the view.

We could then create a similar view for support where support agents are assigned to particular areas of the product.

Bear in mind I'm writing this off the top of my head based on what I've observed in Zendesk previously and the above will probably require some refinement and further design. But it does have the potential for there to be 2 views in Zendesk instead of 17. It also has the potential to make the workflows much cleaner and clearer.

Twixes commented 3 months ago

Problem?

engineers are not always clear on which tickets they should be looking at

I don't think there's a lack of clarity about this, as the handbook's Support Hero outline makes it explicit: product teams should be looking at escalated tickets, and only those.

Screenshot 2024-08-22 at 16 39 03

Side-note: being in multiple teams It's just Ben being in more than one engineering team at a time, and even then Ben is definitely not being Support Hero for two teams at once, so that doesn't seem like an edge case worth spending time on specifically.

As far as I understood @Phanatic and @pauldambra in the team leads meeting where this came up, it was that they prefer going through all tickets, to have a broader view of issues with the product. So what's the actual problem we're trying to solve?

It seems like some engineering teams would like to own more of support than they get to currently… while others (the most ticket-heavy one) a bit less.

Solution?

Perhaps we should be more flexible and more explicit about what gets escalated per product area.

For early-stage products (like e.g. web analytics has been), full ownership of support is a go-to-market must, and we're already doing that. "Do things that don't scale."

Then, if it's feasible for the support heroes of Feature Success and Replay to own more of support, that's actually great for customers – e.g. perhaps Feature Success should ask Customer Success to only triage tickets for them.

But things that don't scale… don't scale: Product Analytics gets 4 times as many escalated tickets as the next team, and we can't be a team of 4 support heroes (that'd be the whole team, as Julian and Thomas are going on parental leave). Triaging and first-line support by Customer Comms is invaluable for us. In fact, we'd love for Customer Comms to be able to investigate and describe issues a bit deeper, with two benefits: 1. support engineers having a stronger understanding of the product, 2. product engineers being able to solve the root causes quicker thanks to reduced context-switching. (This of course relies on Customer Comms having the capacity too.)

Screenshot 2024-08-22 at 14 22 42

Everyone: would the flexibility I describe make sense for you, i.e. Customer Comms explicitly being more or less involved for some products than others?

joethreepwood commented 3 months ago

we'd love for Customer Comms to be able to investigate and describe issues a bit deeper, with two benefits: 1. support engineers having a stronger understanding of the product

I'm out next week, so I'll leave this point with the rest of you but it would be great if we could take steps here to get closer to this!

Phanatic commented 3 months ago

Re: Should we auto-escalate tickets near SLA breach?

I think the answer here is yes, if a ticket is going to drop on the floor and we can't meet our SLA guarantees, then whoever is on support hero duty needs to get involved. Currently, the SLA breaching notifications are helping and avoid the need for the engineering team to triage all opened tickets. The engineering team gets to jump in when necessary instead of triaging every ticket opened in feature success.

Personally, I like looking at all tickets, but this can't be a feasible standard operating procedure for teams where the tickets vastly outnumber the team size.

The root cause for this generally seems to be that engineers are not always clear on which tickets they should be looking at.

TBH, the confusion for me stemmed from the fact that the engineering team looks only at escalated tickets, but the SLA metrics reported include all opened tickets. Which meant that while the team was on top of all the escalated tickets, the metrics still made us look really bad.

I'd posit that we split our internal SLA metrics into escalated and non-escalated. Measuring this difference in behavior might help us pinpoint areas where we need more investment.

pauldambra commented 3 months ago

Which meant that while the team was on top of all the escalated tickets, the metrics still made us look really bad.

What's the quote: "when a metric becomes a target it stops being a useful metric" something to that effect at least.

I think the metric did the right thing here

metric changes does a specific team have to change -> probably and change has been made does eveyrone have to change -> probably not - the support loads aren't the same and it looks like everyone has a slight variation on the "official" approach anyway.

made us look really bad.

FWIW I think you've reacted well here @Phanatic. local changes immediately, raised the concern so we can decide what to do more widely.


if we change something - and i am a broken clock and think talking to each other is enough here...

What if we tackled the root cause and simplified the way engineers look at tickets?

So "escalated" stops being a different place and starts being a ticket status.

Now you have a queue of "escalated" then "high" then "normal" then "low" priority. By local agreement you work through that queue in some way (e.g. team A engineers only look at escalated, team B engineers look at escalated and high... etc)

voila no longer two places :)

pauldambra commented 3 months ago

From Wikipedia, the free encyclopedia Not to be confused with Godwin's law. Goodhart's law is an adage often stated as, "When a measure becomes a target, it ceases to be a good measure".[1] It is named after British economist Charles Goodhart, who is credited with expressing the core idea of the adage in a 1975 article on monetary policy in the United Kingdom:[2]

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.[3]

It was used to criticize the British Thatcher government for trying to conduct monetary policy on the basis of targets for broad and narrow money,[4] but the law reflects a much more general phenomenon.[5]

https://en.wikipedia.org/wiki/Goodhart%27s_law

slshults commented 3 months ago

Currently the ticket volumes and team sizes vary so much that it seems like we may not be able to find a one-size-fits-all solution, yet.

Reading through the discussion so far, I'm wondering if it might work for us to define some aspects that can be applied to all teams (e.g. adding alerts for SLA breaches), and then letting each team define what works best for them as far as views, auto-assignments, auto-escalations etc. go? ("Trust and feedback over process")

(Comms can help set up changes for each team, of course.)

joethreepwood commented 3 months ago

criticize the British Thatcher government

Well, I'm sold!

pauldambra commented 3 months ago
Screenshot 2024-08-23 at 11 07 04

🤣 as I pasted that in I thought "Joe, for sure, will be swayed by this"