Closed aaronschachter closed 7 years ago
Knocking on wood, but things seem quieter today. As of this morning, we no longer have mData's posting to the legacy reportback/:campaignName
endpoint (#706) -- and sending members only to the chatbot
endpoint for Campaigns.
Looking over Heroku metrics now,
Quite a few events are closely tied to other actions we've taken,
But quite a few aren't (Including the two errors fired this afternoon)
It also seems unrelated to throughput or cpu load
(notice the throughput on the bottom has bars in places outside the error which are much higher)
Tomorrow I'm going to do some thorough profiling, these are the steps ill take (in my head)
node --prof server.js
instead of the current value.heroku run bash --app ${testservername}
node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt
Hopefully this nets some valuable insight.
More info here from NodeJS docs: https://nodejs.org/en/docs/guides/simple-profiling/
Thanks @deadlybutter -- looking forward to seeing the results.
The other requests that will be regular and worth looking into as well is POST /v1/signups
-- Quicksilver posts to this endpoint to trigger sending the CampaignBot msg_menu_signedup_external
message to a User when they sign up for a CampaignBot-enabled Campaign on the web.
The DonorsChooseBot mData / endpoint (chatbot?bot_type=donorschoose
) won't be regularly used -- only when we run Science Sleuth (or any other SMS Game that should post donation as part of the end game). We are currently posting to this endpoint on prod now though -- not sure when scheduled to close.
Good to know!
ah crap, let me write this down before i forget. if we still have performance problems we can always try doing a nodejs cluster in the app. it looks like we have the underused resources to make it worth it (ive done this before, not that hard)
loadtest -n 250 -c 25 -k -H "x-gambit-api-key:" -T "application/json" --data '{"phone": "5555555511", "keyword": "slothieboi"}' -m POST http://localhost:5000/v1/chatbot
Only just started but found this,
41 2.2% 2.4% RegExp: ^(\\/?|)([\\s\\S]*?)((?:\\.{1\,2}|[^\\/]+?|)(\\.[^.\\/]*|))(?:[\\/]*)$
(Every item recorded after thereafter is less than 0.5% usage)
So it seems to be referring to this line https://github.com/DoSomething/gambit/blob/32aeaba8f1a306a70a5769ee432c58c11c76db8c/lib/helpers.js#L187
The problem is even when I comment this code out, it consistently remains at 1%
44 1.2% 1.3% RegExp: ^(\\/?|)([\\s\\S]*?)((?:\\.{1\,2}|[^\\/]+?|)(\\.[^.\\/]*|))(?:[\\/]*)$
@aaronschachter so this is where we are at a day after the fix we deployed yesterday. its obviously better but still something seems to be going wrong. Would you prefer I continued auditing or work features?
I'm curious if the Quicksilver Signup imports triggers these alerts. We've had stathat on for a few hours now on production for 4 hours, the POST /v1/signups
route will query Northstar GET signups then GET users for each signup that Quicksilver sends -- not sure how often it queries for /v1/campaigns
endpoints, but just a few times:
cc @sergii-tkachenko @deadlybutter
Could explain it, my only question would be why it happens some times & not others. (Events don't fire nearly enough for it to be everytime)
@deadlybutter Just happened to have production logs opened while receiving another email alert, and the niche import was running. The niche import posts Signup IDs to the Gambit /signups
endpoint -- Gambit queries Northstar for the Signup, then loads the User by the Signup's Northstar UID, in order to send the SMS to the Northstar User's mobile.
You mentioned running tests on the chatbot
endpoint here https://github.com/DoSomething/gambit/issues/704#issuecomment-257994381 -- was there any insight? It might be worth running same tests on signups
endpoint for any clues 🔍
niche import
sounds like a bulk operation? In that case we might want to pull that logic you described into a worker process & queue the signups being imported. Would be fairly simple to do if you want me to work on that at some point
Yup, it's a bulk operation where Quicksilver is posting Signup events to the Gambit /signups
endpoint to trigger Signup confirmation SMS messages. Just checked the logs again upon getting an email alert, and I'm beginning to think this could be the culprit.
Have you had any luck running tests on the chatbot
and signups
POST requests?
Niche import is currently not sending us Users, but we're still seeing alerts today.. so it doesn't seem like it's the bulk import.
@aaronschachter Possible there is other things happening, Niche import just one of them?
@deadlybutter since Niche import does not explain the long response times, there must be other unknown causes. We have no idea if these alerts are a non-issue or are seriously affecting the experience for a lot of users. It'd be super helpful to gain an understanding so that we can confidently close this issue and start on other gambit features. Could you resume auditing and continue to share the results of your tests on this issue? Thank you!
@sheyd looping you in here. We're constantly receiving Heroku alerts saying response times have exceeded 1250ms. Would greatly appreciate your help if you have any insight into what may be causing longer response times.
@jamjensen as I said above, im inclined to believe multiple things could be happening which are causing this. I'll take a deeper look at the logs & try to correlate them with the errors Heroku is reporting.
In addition the Aaron seeing the link between bulk import & bad response times (which makes sense), I've seen an uptick in failed request emails. Which to me says we have something else happening causing the app to freak out.
Digging into some of the transactions, and some of the Northstar Signups query (which proxy to the Phoenix API Signups query) are slow. The middleware waits on the query result to determine which message to send.. which may account for the long middleware response times too.
We query the Signups index endpoint on every request to always get the current campaign status for the given user. Even if they've signed up already and we have a document, we make the request to get the latest quantity (which we present to the User when we say "we've got you down for [quantity] [nouns] [verbed]". It might be time to reconsider making the query to get latest Signup Status for each Gambit request.
My suspicion is that it's the network requests to Northstar and Phoenix that cause our slow response times -- because we don't send a chatbot
HTTP response back until those requests have finished (sending either the rendered chatbot message, or an error code).
@deadlybutter and I chatted today about always sending a 200 back once we receive the message. Slow Northstar and Phoenix transactions should still show up in New Relic... but this should help avoid sending the timeout error to Mobile Commons, and potentially stop the apdex alerts.
Here's an example where it doesn't look like a network request is the issue... which doesn't exactly help troubleshoot.
A nice to have here would be adding custom attributes like User ID, Campaign ID, Signup ID, etc into New Relic: https://docs.newrelic.com/docs/agents/manage-apm-agents/agent-data/collect-custom-attributes
Phoenix response times are being handled in https://github.com/DoSomething/devops/issues/261. #810 and work done in #849 close this issue on the Gambit side.
Heroku has been sending a number of response time alerts, which seem to have started the first day we pushed the CampaignBot code to production 😿