codeforamerica / snap-it-up

Super-simple dashboard showing the status of SNAP-related web services.
http://status.citizenonboard.com/
BSD 3-Clause "New" or "Revised" License
3 stars 5 forks source link

Write an awesome article #6

Open lippytak opened 9 years ago

lippytak commented 9 years ago

@bengolder let's start dropping notes here and build out an outline.

bengolder commented 9 years ago

I made a google doc here. I'll copy paste to this issue thread.

This is about getting reliable services in the US. It should help people work towards monitoring solutions for services, and they should understand the potential impact of unexpected service interruptions.

Goals for Audiences:

bengolder commented 9 years ago

Article Outline:

daguar commented 9 years ago

The meme I've always like with this is this:

"When HealthCare.gov crashed for middle class Americans it was a crisis. But in the social safety net, the status quo is crisis."

lippytak commented 9 years ago

Part of an old email I wrote to myself in the middle of the night:

More and more I'm starting to feel that our entire social safety net is failing just like HC.gov, except nobody's watching and those who are failed have no voice. What's worse, it's been failing so badly and for so long that what should be considered an emergency has become the status quo. The fact that California's main benefits website doesn't support mobile AT ALL is an emergency when your target demo doesn't have computers. The fact that veterans in SF jails get released with 4 nights of shelter housing and <$100 cash is an emergency. I watched a CoveredCA webinar last week and jotted down this note: "Far and away, the worst user experience is for those who need it most and are least empowered (MediCal eligibles). This is our opportunity from a social justice/advocacy standpoint. Concretely, there is no coherent way to enroll in MediCal through CoveredCA, and if anything starting with CoveredCA is a long, arduous, confusing detour that just ends with a cold referral to a phone number that begins the process anew. People don’t know what MediCal is, why they are suddenly not allowed to “shop” for plans, and left wondering why they went through this process to begin with." These are emergencies, but there is no urgency anywhere...

alanjosephwilliams commented 9 years ago

I love the phrase "emergencies with no urgency"

On Tue, Jan 27, 2015 at 6:03 PM, Jake Solomon notifications@github.com wrote:

Part of an old email I wrote to myself in the middle of the night:

More and more I'm starting to feel that our entire social safety net is failing just like HC.gov, except nobody's watching and those who are failed have no voice. What's worse, it's been failing so badly and for so long that what should be considered an emergency has become the status quo. The fact that California's main benefits website doesn't support mobile AT ALL is an emergency when your target demo doesn't have computers. The fact that veterans in SF jails get released with 4 nights of shelter housing and <$100 cash is an emergency. I watched a CoveredCA webinar last week and jotted down this note: "Far and away, the worst user experience is for those who need it most and are least empowered (MediCal eligibles). This is our opportunity from a social justice/advocacy standpoint. Concretely, there is no coherent way to enroll in MediCal through CoveredCA, and if anything starting with CoveredCA is a long, arduous, confusing detour that just ends with a cold referral to a phone number that begins the process anew. People don’t know what MediCal is, why they are suddenly not allowed to “shop” for plans, and left wondering why they went through this process to begin with." These are emergencies, but there is no urgency anywhere...

— Reply to this email directly or view it on GitHub https://github.com/codeforamerica/snap-it-up/issues/6#issuecomment-71768303 .

t: @alanjosephwilli p: 817 713 6264

bensheldon commented 9 years ago

Sorry to butt in (damn GitHub auto subscribe to new repos) but could you use/reuse/inspire Open311status for this? Http://open311status.org ?

I'm also working on a ruby version. Its probably a little more reliable to use a 3rd party service, but ETLing uptime time series data can suck (I wrote that for my day job for platform uptime and it starts to suck at the 500+ site/minute granularity level).

RebeccaCoelius commented 9 years ago

Love this. In medicine the corollary is that our society has decided that we have to intervene if somebody is at risk of dying that day in front of us (Emergency Rooms/EMTALA) but its somehow ok if it happens over years, even if the path dependency is fully acknowledged.

lippytak commented 9 years ago

@bengolder is steppin up and steppin in to own this issue :) (aka v1)

bengolder commented 9 years ago

:thumbsup:

Mr0grog commented 9 years ago

@bensheldon Sorry I didn’t see any of this thread; I think it happened during the day that I didn’t realize GitHub unsubscribed me from the repo when I transferred it to CfA.

could you use/reuse/inspire Open311status for this? Http://open311status.org ?

I actually suggested this when I first heard about the idea of monitoring SNAP services from @lippytak and @alanjosephwilliams. My understanding is that there were a few reasons they didn’t go that way:

Further, the only reason this repo really came into existence was because, just after I got back from traveling, I was chatting with @alanjosephwilliams, who noted it would be nice just to have a simple map showing what service portals were down. Incredibly straightforward and obvious. So I did that :)

lippytak commented 9 years ago

Relevant: http://www.cbpp.org/cms/?fa=view&id=618

Mr0grog commented 9 years ago

Two notes that may be worth covering in any writeup:

  1. Watching the constant up-down-up-down-up-down-up-down-good-lord-when-will-this-end-up-down-up stream of alerts for MyBenefitsCalwin this weekend and today is painful. Not only is this a huge pain for users (I wonder if sessions get reset when the system goes down and back up again…), but it’s also an example of the worst kind of deployment process (@lippytak noted they are deploying a new version of the site, so that’s probably the cause of this).
  2. Was looking at the now-autogenerated screenshots and noticed this gem from New Mexico:

    nm-54bc3265be653d3f4a065dc3-2015-02-18t05-23-06 00-00

    Where their upgrade is not going to include any applications that have been started but not submitted. The description also makes it sound like there’s a two-day window before the upgrade when actual, submitted applications won’t be valid, either. Oof.

lippytak commented 9 years ago

This is what deployments look like: screen shot 2015-02-17 at 10 55 59 pm

Mr0grog commented 9 years ago

:sob:

alanjosephwilliams commented 9 years ago

dear lord on both fronts

On Tue, Feb 17, 2015 at 11:34 PM, Rob Brackett notifications@github.com wrote:

[image: :sob:]

— Reply to this email directly or view it on GitHub https://github.com/codeforamerica/snap-it-up/issues/6#issuecomment-74824101 .

t: @alanjosephwilli p: 817 713 6264

lippytak commented 9 years ago

We could call it One Nines (http://en.wikipedia.org/wiki/High_availability)

screen shot 2015-03-05 at 5 15 18 pm

lippytak commented 9 years ago

screen shot 2015-03-05 at 5 17 45 pm

"Your website shouldn't have more days off than you do."

Mr0grog commented 9 years ago

Your website shouldn't have more days off than you do.

That is a fantastic line.

bengolder commented 9 years ago

Where can I find screenshots from downtime?

Mr0grog commented 9 years ago

They live here: https://s3-us-west-2.amazonaws.com/snap-snapshots/

Mr0grog commented 9 years ago

The state-pages branch (in PR #29) displays snapshots on the state pages.

alanjosephwilliams commented 9 years ago

Document of anecdotes, assets, and more-meat-on-the-bones version of Ben's outline in progress here: https://docs.google.com/document/d/1JJYceEpF7_eYAr33SdsD4xawoHeNnBZsrnuxnvRpQTw/edit

Mr0grog commented 9 years ago

@bengolder FYI, I just cleared out all the broken snapshots on S3.

lippytak commented 9 years ago

How downtime reporting works now: screen shot 2015-03-21 at 2 10 45 pm

That gets embedded in an email ^. Then the email gets forwarded around. Looking at email timestamps in this particular case it didn't reach the end user (Liliana) until ~3pm, when the outages started at ~2am according to Pingometer. About a ~12 hour lag...

RebeccaCoelius commented 9 years ago

With no estimated time to resolve

On Saturday, March 21, 2015, Jake Solomon notifications@github.com wrote:

How downtime reporting works now: [image: screen shot 2015-03-21 at 2 10 45 pm] https://cloud.githubusercontent.com/assets/2533112/6766852/1e0c245e-cfd4-11e4-8ba9-d193d9d3af4a.png

That gets embedded in an email ^. Then the email gets forwarded around. Looking at email timestamps in this particular case it didn't reach the end user (Liliana) until ~3pm, when the outages started at ~2am according to Pingometer. About a ~12 hour lag...

— Reply to this email directly or view it on GitHub https://github.com/codeforamerica/snap-it-up/issues/6#issuecomment-84453923 .

Rebecca Coelius, MD Director of Health Code for America 415 298 2872 LinkedIn http://www.linkedin.com/in/rebeccacoelius Twitter http://twitter.com/#!/RebeccaCoelius

lippytak commented 9 years ago

Hah good point @RebeccaCoelius . Plain language summary of above email: "We don't know what's wrong or when we'll fix it or when we'll be in touch next and we are not sorry for any inconvenience."

alanjosephwilliams commented 9 years ago

Externalizing this work through an awesome article remains a key tactic to capture the value of this work. I've reassigned to myself for the time being, but we should divide labor appropriately when we discuss the scope of outstanding work necessary to realize this goal.

alanjosephwilliams commented 9 years ago

I should document here (as I have elsewhere) though that the work to date made our presentation of this story in talks at HealthRefactored and Etsy a big success. So much tremendous work has been done by this group. Its awesome.