Closed yaelberger-commits closed 2 years ago
complete batch saving and lambda api before baselining
Blocked while awaiting metrics
Hey team! Please add your planning poker estimate with ZenHub @andrewleith @ikenna-cds @jimleroyer @jzbahrai @sastels
This is a rather large ticket. Propose breaking up into a few such as
In regards to "Test limits of Notify to know what # of API requests per minute Notify can handle (currently 6,000 emails/minute)", I took a look at our busiest email day in the dump Jimmy provided, and I can see that more than 6000 email notifications a minute were created on that day, 38 different times.
For the minutes that exceeded 6000 email requests, the number went as high as 12226 email notifications created at 00:02.
We could turn this card into the Epic for Baselining, and create sub cards for each of the three tasks above. Would you like me to do that @sastels ?
Baseline Notify performance
Description
As a Product Manager of Notify, I need to be able to know how Notify is performing today for email and SMS sending, so that I can plan if more reliability and scaling work needs to be done, and communicate our SLA and SLO to stakeholders and users.
Miro board "Evaluating Notify with Metrics" for reference https://miro.com/app/board/o9J_llx_4BQ=/
WHY are we building? We need to know what our current baseline performance metrics are today so we can validate if our reliability efforts have improved our performance, and to decide if where we're at is good enough, or if more effort is needed to improve Notify's performance.
WHAT are we building? We're building a set of baseline performance metrics for Notify email and SMS message sending
VALUE created by our solution We will have data to drive our decision making on what to work on next. Our users know what to expect from Notify's performance.
Acceptance Criteria** (Definition of done)
To be refined through discussion with the team
Critical to know how we're performing in production 2nd goal: How should we model the future? (break this off into another piece)
Given some context, when (X) action occurs, then (Y) outcome is achieved
[ ] Build awareness among the team and we all know what our current baseline is
[ ] Measure everything in our SLO https://docs.google.com/spreadsheets/d/1fU-FJ7THfWEqNhbpQ4r22MQipFwC7SP851ZQnOf68cM/edit#gid=0
[ ] Use historic data to baseline what we deliver to clients in production
[ ] Historically, what is our max sending rate?
[ ] What's our current sending capacity?
[ ] Priority queue, bulk queue and normal queue breaking points
[ ] Know max delays for various kinds of sending we have seen (bulk, manual, email, SMS) 90-99% of the time
[ ] Comparing performance against # of support tickets, # of incidents
[X] See how much our volume of messages sent has grown
Measuring success and metrics, both reliability and speed
*Might have to use data in prod rather than performance test (flag that we can improve how we measure, but this will take more time).