Investigate response times

aaronschachter commented 8 years ago

Heroku has been sending a number of response time alerts, which seem to have started the first day we pushed the CampaignBot code to production 😿

aaronschachter commented 8 years ago

Knocking on wood, but things seem quieter today. As of this morning, we no longer have mData's posting to the legacy reportback/:campaignName endpoint (#706) -- and sending members only to the chatbot endpoint for Campaigns.

itsjoekent commented 8 years ago

Looking over Heroku metrics now,

Quite a few events are closely tied to other actions we've taken,

But quite a few aren't (Including the two errors fired this afternoon)

It also seems unrelated to throughput or cpu load

(notice the throughput on the bottom has bars in places outside the error which are much higher)

Tomorrow I'm going to do some thorough profiling, these are the steps ill take (in my head)

Create a test dyno formation w/ a custom version of Gambit. In the procfile define the command to start as node --prof server.js instead of the current value.
Generate substantial number of [requests to chatbot](https://github.com/DoSomething/gambit/blob/develop/documentation/endpoints/chatbot.md#chat.
Pull profile logs from server w/ heroku run bash --app ${testservername}
Process logs into something readable locally node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt

Hopefully this nets some valuable insight.

More info here from NodeJS docs: https://nodejs.org/en/docs/guides/simple-profiling/

aaronschachter commented 8 years ago

Thanks @deadlybutter -- looking forward to seeing the results.

The other requests that will be regular and worth looking into as well is POST /v1/signups -- Quicksilver posts to this endpoint to trigger sending the CampaignBot msg_menu_signedup_external message to a User when they sign up for a CampaignBot-enabled Campaign on the web.

The DonorsChooseBot mData / endpoint (chatbot?bot_type=donorschoose) won't be regularly used -- only when we run Science Sleuth (or any other SMS Game that should post donation as part of the end game). We are currently posting to this endpoint on prod now though -- not sure when scheduled to close.

itsjoekent commented 8 years ago

Good to know!

itsjoekent commented 8 years ago

ah crap, let me write this down before i forget. if we still have performance problems we can always try doing a nodejs cluster in the app. it looks like we have the underused resources to make it worth it (ive done this before, not that hard)

itsjoekent commented 8 years ago

loadtest -n 250 -c 25 -k -H "x-gambit-api-key:" -T "application/json" --data '{"phone": "5555555511", "keyword": "slothieboi"}' -m POST http://localhost:5000/v1/chatbot

https://www.npmjs.com/package/loadtest

itsjoekent commented 8 years ago

Only just started but found this,

     41    2.2%    2.4%  RegExp: ^(\\/?|)([\\s\\S]*?)((?:\\.{1\,2}|[^\\/]+?|)(\\.[^.\\/]*|))(?:[\\/]*)$

(Every item recorded after thereafter is less than 0.5% usage)

itsjoekent commented 8 years ago

So it seems to be referring to this line https://github.com/DoSomething/gambit/blob/32aeaba8f1a306a70a5769ee432c58c11c76db8c/lib/helpers.js#L187

The problem is even when I comment this code out, it consistently remains at 1%

     44    1.2%    1.3%  RegExp: ^(\\/?|)([\\s\\S]*?)((?:\\.{1\,2}|[^\\/]+?|)(\\.[^.\\/]*|))(?:[\\/]*)$

itsjoekent commented 8 years ago

@aaronschachter so this is where we are at a day after the fix we deployed yesterday. its obviously better but still something seems to be going wrong. Would you prefer I continued auditing or work features?

aaronschachter commented 8 years ago

I'm curious if the Quicksilver Signup imports triggers these alerts. We've had stathat on for a few hours now on production for 4 hours, the POST /v1/signups route will query Northstar GET signups then GET users for each signup that Quicksilver sends -- not sure how often it queries for /v1/campaigns endpoints, but just a few times:

cc @sergii-tkachenko @deadlybutter

itsjoekent commented 8 years ago

Could explain it, my only question would be why it happens some times & not others. (Events don't fire nearly enough for it to be everytime)

aaronschachter commented 8 years ago

@deadlybutter Just happened to have production logs opened while receiving another email alert, and the niche import was running. The niche import posts Signup IDs to the Gambit /signups endpoint -- Gambit queries Northstar for the Signup, then loads the User by the Signup's Northstar UID, in order to send the SMS to the Northstar User's mobile.

You mentioned running tests on the chatbot endpoint here https://github.com/DoSomething/gambit/issues/704#issuecomment-257994381 -- was there any insight? It might be worth running same tests on signups endpoint for any clues 🔍

itsjoekent commented 8 years ago

niche import sounds like a bulk operation? In that case we might want to pull that logic you described into a worker process & queue the signups being imported. Would be fairly simple to do if you want me to work on that at some point

aaronschachter commented 8 years ago

Yup, it's a bulk operation where Quicksilver is posting Signup events to the Gambit /signups endpoint to trigger Signup confirmation SMS messages. Just checked the logs again upon getting an email alert, and I'm beginning to think this could be the culprit.

Have you had any luck running tests on the chatbot and signups POST requests?

aaronschachter commented 8 years ago

Niche import is currently not sending us Users, but we're still seeing alerts today.. so it doesn't seem like it's the bulk import.

itsjoekent commented 8 years ago

@aaronschachter Possible there is other things happening, Niche import just one of them?

jamjensen commented 8 years ago

@deadlybutter since Niche import does not explain the long response times, there must be other unknown causes. We have no idea if these alerts are a non-issue or are seriously affecting the experience for a lot of users. It'd be super helpful to gain an understanding so that we can confidently close this issue and start on other gambit features. Could you resume auditing and continue to share the results of your tests on this issue? Thank you!

@sheyd looping you in here. We're constantly receiving Heroku alerts saying response times have exceeded 1250ms. Would greatly appreciate your help if you have any insight into what may be causing longer response times.

itsjoekent commented 8 years ago

@jamjensen as I said above, im inclined to believe multiple things could be happening which are causing this. I'll take a deeper look at the logs & try to correlate them with the errors Heroku is reporting.

In addition the Aaron seeing the link between bulk import & bad response times (which makes sense), I've seen an uptick in failed request emails. Which to me says we have something else happening causing the app to freak out.

aaronschachter commented 7 years ago

Digging into some of the transactions, and some of the Northstar Signups query (which proxy to the Phoenix API Signups query) are slow. The middleware waits on the query result to determine which message to send.. which may account for the long middleware response times too.

We query the Signups index endpoint on every request to always get the current campaign status for the given user. Even if they've signed up already and we have a document, we make the request to get the latest quantity (which we present to the User when we say "we've got you down for [quantity] [nouns] [verbed]". It might be time to reconsider making the query to get latest Signup Status for each Gambit request.

aaronschachter commented 7 years ago

My suspicion is that it's the network requests to Northstar and Phoenix that cause our slow response times -- because we don't send a chatbot HTTP response back until those requests have finished (sending either the rendered chatbot message, or an error code).

@deadlybutter and I chatted today about always sending a 200 back once we receive the message. Slow Northstar and Phoenix transactions should still show up in New Relic... but this should help avoid sending the timeout error to Mobile Commons, and potentially stop the apdex alerts.

aaronschachter commented 7 years ago

Here's an example where it doesn't look like a network request is the issue... which doesn't exactly help troubleshoot.

A nice to have here would be adding custom attributes like User ID, Campaign ID, Signup ID, etc into New Relic: https://docs.newrelic.com/docs/agents/manage-apm-agents/agent-data/collect-custom-attributes

aaronschachter commented 7 years ago

Phoenix response times are being handled in https://github.com/DoSomething/devops/issues/261. #810 and work done in #849 close this issue on the Gambit side.

DoSomethingArchive / gambit-content

Investigate response times #704