codeforamerica / balance

A text message system for checking one's EBT card balance (SNAP benefits and more)
MIT License
47 stars 37 forks source link

Fix occasional problem where recording+callback doesn't happen because of silence #303

Open daguar opened 9 years ago

daguar commented 9 years ago

Copying description from the bug investigation #302:

It appears that a small number of calls (I see a few TX, and one PA) are not recording despite the "record" directive — and therefore not transcribing and sending a callback with the transcription body AKA not responding back to the user.

All of the calls exhibiting this behavior have a duration of ~20 seconds, so I'm pretty sure the call's just ending because it hears silence for too long before the system can read out the balance.

Let me explain with a fake example — here's our TwiML requested by the call API when we are initiating the phone call to the system looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Play digits="wwww1wwwwww0000111122223333444ww"/>
    <Record transcribe="true" transcribeCallback="https://balance-production.herokuapp.com/TX/13334445555/12101112222/send_balance" maxLength="18"/>
</Response>

What I think is happening is that the Play directive (which simulates the button pushes) is happening and then the Record directive starts BUT there's a silent pause on the state line's end and so the Record directive hears silence and says "okay! nothing to record here! will just end the call."

I think there are two possible solutions here, both involving playing a bit with the button sequence for the affected states:

  1. Add more waiting at the end of the button pushing (ie, maybe 1-2 more ww's at the end), OR
  2. Figure out if the phone system wants you to push a button (like #) at the end of entering your EBT number, and add THAT as a button push at the end EBT # input part of the button sequence

A relatively small number of users is affected, but it's still not a good experience for those few — they just never hear back after the "Thanks! Please wait..." message and that sucks.

A good solution for this involves both:

A. Implementing one of these fixes and B. Setting up more rigorous monitoring for this specific failure rate so we know if it's happening again

daguar commented 9 years ago

I have put some Call SIDs for a sample of affected calls in the private repo here: https://github.com/codeforamerica/health-private/issues/7

The way to look for this in the logs generally is to:

  1. In Twilio dashboard, go to Logs -> Calls
  2. Look for Outgoing calls where there is no recording icon on the right, AND where there is no Incoming call right before it (these are ones where the user is just calling through our system to the state system)
daguar commented 9 years ago

Got another just now

daguar commented 9 years ago

Other alert instances of this:

I'm less concerned with the CA one because at our volume it's a real outlier to have only 1 instance of the problem. Texas and the others are problematic.

I'm considering turning off AdWords for all states except CA since we're not going deep there anyway. Thoughts @lippytak @alanjosephwilliams?

For Texas, I think I'll just throw another second or so of wait-time on there and see if it resolves, and, if not, consider turning it off.

alanjosephwilliams commented 9 years ago

I support turning off adwords outside of California. I trust your judgement completely, but this seems like exactly this kind of juncture where even a modest investment (in this case, doing some manual-but-rigorous probing of the IVR for at least 4 states) should be evaluated against other priorities.

daguar commented 9 years ago

310 should help with Texas at least.