codeforamerica / balance

A text message system for checking one's EBT card balance (SNAP benefits and more)
MIT License
47 stars 37 forks source link

Investigate monitor-generated callback problems #302

Closed daguar closed 9 years ago

daguar commented 9 years ago

Have had 3 warnings of delayed/not-happening callbacks since launching more aggressive monitoring of it a few days ago. Looking into the problem and will document here.

daguar commented 9 years ago

It appears that a small number of calls (I see a few TX, and one PA) are not recording despite the "record" directive — and therefore not transcribing and sending a callback with the transcription body AKA not responding back to the user.

All of the calls exhibiting this behavior have a duration of ~20 seconds, so I'm pretty sure the call's just ending because it hears silence for too long before the system can read out the balance.

Let me explain with a fake example — here's our TwiML requested by the call API when we are initiating the phone call to the system looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Play digits="wwww1wwwwww0000111122223333444ww"/>
    <Record transcribe="true" transcribeCallback="https://balance-production.herokuapp.com/TX/13334445555/12101112222/send_balance" maxLength="18"/>
</Response>

What I think is happening is that the Play directive (which simulates the button pushes) is happening and then the Record directive starts BUT there's a silent pause on the state line's end and so the Record directive hears silence and says "okay! nothing to record here! will just end the call."

I think there are two possible solutions here, both involving playing a bit with the button sequence for the affected states:

  1. Add more waiting at the end of the button pushing (ie, maybe 1-2 more ww's at the end), OR
  2. Figure out if the phone system wants you to push a button (like #) at the end of entering your EBT number, and add THAT as a button push at the end EBT # input part of the button sequence

A relatively small number of users is affected, but it's still not a good experience for those few — they just never hear back after the "Thanks! Please wait..." message and that sucks.

A good solution for this involves both:

A. Implementing one of these fixes and B. Setting up more rigorous monitoring for this specific failure rate so we know if it's happening again

daguar commented 9 years ago

Closing this (the investigation part) and have opened #303 for the bug-fix part