Closed DFurnes closed 6 years ago
Filling in the timeline as best I can, and Slack thread here!
Confirming that I'm able to request a new authentication token from Northstar & upload a photo to Rogue via Paw, so this seems to be an issue that's isolated to Phoenix.
Yup, bunch of 401s coming in on Phoenix's v2/campaigns/:id/posts
route (Papertrail), ex:
Nov 12 10:27:19 dosomething-phoenix heroku/router: at=info method=POST path="/api/v2/campaigns/3ZbXS7fAXS8uqmGCmQ8Eu4/posts" host=www.dosomething.org request_id=4618895c-80b2-4b19-8b6f-f1b34842f370 fwd="204.169.220.182, 204.169.220.182,104.156.83.24" dyno=web.1 connect=0ms service=1533ms status=401 bytes=306 protocol=https
Nov 12 10:27:22 dosomething-phoenix heroku/router: at=info method=POST path="/api/v2/campaigns/79UhtzU6u4m80AYcUayYUU/posts" host=www.dosomething.org request_id=33fa0fb3-2d84-4c22-bdea-9c30cb496603 fwd="129.130.18.97, 129.130.18.97,157.52.93.34" dyno=web.1 connect=0ms service=28ms status=401 bytes=306 protocol=https
Nov 12 10:27:46 dosomething-phoenix heroku/router: at=info method=POST path="/api/v2/campaigns/79UhtzU6u4m80AYcUayYUU/posts" host=www.dosomething.org request_id=7afd1ed4-a765-45d8-9556-755eb9a7227d fwd="165.29.50.189, 165.29.50.189,157.52.86.45" dyno=web.1 connect=0ms service=29ms status=401 bytes=306 protocol=https
I'm still seeing this issue on Preview & QA (which are both running the buggy v208). It's a bummer we didn't catch this when testing post-deploy. I could've sworn I ran through the uploader flow. Uff.
This is a good reminder that we need to set up Ghost Inspector monitoring for this app now that it's serving the majority of our production web traffic (and soon to be all)! π₯
Mendel figured it out! We refactored how authentication tokens are passed to Gateway's RestApiClient and were sending { Authorization: β¦}
instead of the expected { headers: { Authorization: β¦ }}
OOooooohhhhhhhh π΅οΈ
Updated the timeline with everything that's happened till now! Mendel's fix (above) is live on QA and we've confirmed that photo uploading works once again. :v:
I pulled down Phoenix's logs from Friday-Monday and filtered for 401s on the post route β we had 5120 total errors for this over the course of the weekend (although we don't log enough details to be able to know how many of these were from repeat attempts by the same user).
Thanks @DFurnes. We can't say that we lost 5120 reportbacks, but that's still a big number that's worth sharing with the org, along with a plan of what we're going to do differently. This is worth a post-mortem first.
Totally agreed! Mendel just queried Keen for number of unique users who received a "failed reportback" event and got ~700, which is already much less of a bummer.
Ran a more refined Keen query based on the exact times we deployed and rolled back (Fri 1pm - Mon - 11am) which yields a total of 4.93k failed reportbacks by 898 unique users. Full count query Uniqued by Northstar ID query
Huh. So, do we interpret that to say that, during this period, the average user tried to submit 5 1/2 times before giving up?
I believe so. We can check the full CSV to confirm there's no outlier here, but generally we do see people try again (and again?) when there's a RB failure Our Members don't give up πͺ
Yeah perusing the CSV export of this data definitely gives the impression that certain folks try repeatadly (caught one user with 24 consecutive requests!) and some abandon ship after one or two tries.
Hannah had the fabulous idea to reach out to affected members (since we know their user IDs via the Puck phoenix_failed_post_request
events in Keen). I've provided Anthony with a select_unique
version of Mendel's query above, and we'll use that to let people know they can still report back!
π Excellent!
Published this and closing it out.
INCIDENT
What's gone wrong?
We've been receiving support tickets that users are unable to report back on Phoenix, receiving an "Unauthenticated" message in the uploader when they try to submit:
Timeline
master
with that fix at 2:20pm, and ran through manual testing of signup, photo/text/share post, quiz, and article flows on production to make sure no new issues appeared.Relevant Screenshots + Links