DFurnes commented 6 years ago

INCIDENT

What's gone wrong?

We've been receiving support tickets that users are unable to report back on Phoenix, receiving an "Unauthenticated" message in the uploader when they try to submit:

screen_shot_2018-11-11_at_10 24 16_pm

Timeline

Deployed Rogue on Thursday at 10:47am EST (diff).
- Deployed Northstar on Friday at 2:49pm EST (diff).
Deployed Phoenix on Friday at 1:22pm EST (diff), and at 2:49pm EST (diff).
Help tickets began coming in Friday at ~12:31am EST~ (unrelated scholarship question) 9:33pm EST and continued through weekend. Hannah found them and compiled them, and raised in #team-product Monday at 1:50am EST.
Matt CC'd the issue in #dev-phoenix at 7:21am.
Mendel jumped in and started digging into Rogue errors at 9:46am.
Dave saw Hannah's message in #team-product at 10:10am & created this issue. 🤓
Dave rolled back Phoenix to v206 at 10:35am, resolving the issue in production.
- Mendel figured out the underlying issue at 10:58am, and pushed up a fix in DoSomething/phoenix#1182 at 11:17am. This fixed things up on QA.
- We re-deployed master with that fix at 2:20pm, and ran through manual testing of signup, photo/text/share post, quiz, and article flows on production to make sure no new issues appeared.
- We reached out to members who were affected by the bug via email on Monday at 6:08pm.

Relevant Screenshots + Links

DFurnes commented 6 years ago

Filling in the timeline as best I can, and Slack thread here!

DFurnes commented 6 years ago

Confirming that I'm able to request a new authentication token from Northstar & upload a photo to Rogue via Paw, so this seems to be an issue that's isolated to Phoenix.

DFurnes commented 6 years ago

Yup, bunch of 401s coming in on Phoenix's v2/campaigns/:id/posts route (Papertrail), ex:

Nov 12 10:27:19 dosomething-phoenix heroku/router: at=info method=POST path="/api/v2/campaigns/3ZbXS7fAXS8uqmGCmQ8Eu4/posts" host=www.dosomething.org request_id=4618895c-80b2-4b19-8b6f-f1b34842f370 fwd="204.169.220.182, 204.169.220.182,104.156.83.24" dyno=web.1 connect=0ms service=1533ms status=401 bytes=306 protocol=https 
Nov 12 10:27:22 dosomething-phoenix heroku/router: at=info method=POST path="/api/v2/campaigns/79UhtzU6u4m80AYcUayYUU/posts" host=www.dosomething.org request_id=33fa0fb3-2d84-4c22-bdea-9c30cb496603 fwd="129.130.18.97, 129.130.18.97,157.52.93.34" dyno=web.1 connect=0ms service=28ms status=401 bytes=306 protocol=https 
Nov 12 10:27:46 dosomething-phoenix heroku/router: at=info method=POST path="/api/v2/campaigns/79UhtzU6u4m80AYcUayYUU/posts" host=www.dosomething.org request_id=7afd1ed4-a765-45d8-9556-755eb9a7227d fwd="165.29.50.189, 165.29.50.189,157.52.86.45" dyno=web.1 connect=0ms service=29ms status=401 bytes=306 protocol=https

DFurnes commented 6 years ago

Rolled Phoenix back to v206 (reverting v207 and v208), and the photo uploader seems to be working once again. We can now dig into what went wrong in those releases at a more leisurely pace. 😅

DFurnes commented 6 years ago

I'm still seeing this issue on Preview & QA (which are both running the buggy v208). It's a bummer we didn't catch this when testing post-deploy. I could've sworn I ran through the uploader flow. Uff.

This is a good reminder that we need to set up Ghost Inspector monitoring for this app now that it's serving the majority of our production web traffic (and soon to be all)! 🚥

DFurnes commented 6 years ago

Mendel figured it out! We refactored how authentication tokens are passed to Gateway's RestApiClient and were sending { Authorization: …} instead of the expected { headers: { Authorization: … }}

mshmsh5000 commented 6 years ago

OOooooohhhhhhhh 🕵️

DFurnes commented 6 years ago

Updated the timeline with everything that's happened till now! Mendel's fix (above) is live on QA and we've confirmed that photo uploading works once again. :v:

DFurnes commented 6 years ago

I pulled down Phoenix's logs from Friday-Monday and filtered for 401s on the post route – we had 5120 total errors for this over the course of the weekend (although we don't log enough details to be able to know how many of these were from repeat attempts by the same user).

mshmsh5000 commented 6 years ago

Thanks @DFurnes. We can't say that we lost 5120 reportbacks, but that's still a big number that's worth sharing with the org, along with a plan of what we're going to do differently. This is worth a post-mortem first.

DFurnes commented 6 years ago

Totally agreed! Mendel just queried Keen for number of unique users who received a "failed reportback" event and got ~700, which is already much less of a bummer.

mendelB commented 6 years ago

Ran a more refined Keen query based on the exact times we deployed and rolled back (Fri 1pm - Mon - 11am) which yields a total of 4.93k failed reportbacks by 898 unique users. Full count query Uniqued by Northstar ID query

mshmsh5000 commented 6 years ago

Huh. So, do we interpret that to say that, during this period, the average user tried to submit 5 1/2 times before giving up?

mendelB commented 6 years ago

I believe so. We can check the full CSV to confirm there's no outlier here, but generally we do see people try again (and again?) when there's a RB failure Our Members don't give up 💪

mendelB commented 6 years ago

Yeah perusing the CSV export of this data definitely gives the impression that certain folks try repeatadly (caught one user with 24 consecutive requests!) and some abandon ship after one or two tries.

DFurnes commented 6 years ago

Hannah had the fabulous idea to reach out to affected members (since we know their user IDs via the Puck phoenix_failed_post_request events in Keen). I've provided Anthony with a select_unique version of Mendel's query above, and we'll use that to let people know they can still report back!

mshmsh5000 commented 6 years ago

👍 Excellent!

DFurnes commented 6 years ago

Published this and closing it out.

DoSomething / infrastructure

Incident: Users unable to create posts on Phoenix. #64

INCIDENT

What's gone wrong?

Timeline

Relevant Screenshots + Links