Description

As a Product Manager of Notify, I need to be able to know how Notify is performing today for email and SMS sending, so that I can plan if more reliability and scaling work needs to be done, and communicate our SLA and SLO to stakeholders and users.

Miro board "Evaluating Notify with Metrics" for reference https://miro.com/app/board/o9J_llx_4BQ=/

WHY are we building? We need to know what our current baseline performance metrics are today so we can validate if our reliability efforts have improved our performance, and to decide if where we're at is good enough, or if more effort is needed to improve Notify's performance.

WHAT are we building? We're building a set of baseline performance metrics for Notify email and SMS message sending

VALUE created by our solution We will have data to drive our decision making on what to work on next. Our users know what to expect from Notify's performance.

Acceptance Criteria** (Definition of done)

To be refined through discussion with the team

Critical to know how we're performing in production 2nd goal: How should we model the future? (break this off into another piece)

Given some context, when (X) action occurs, then (Y) outcome is achieved

[ ] Build awareness among the team and we all know what our current baseline is
[ ] Measure everything in our SLO https://docs.google.com/spreadsheets/d/1fU-FJ7THfWEqNhbpQ4r22MQipFwC7SP851ZQnOf68cM/edit#gid=0
[ ] Use historic data to baseline what we deliver to clients in production
[ ] Historically, what is our max sending rate?
[ ] What's our current sending capacity?
[ ] Priority queue, bulk queue and normal queue breaking points
[ ] Know max delays for various kinds of sending we have seen (bulk, manual, email, SMS) 90-99% of the time
[ ] Comparing performance against # of support tickets, # of incidents
[X] See how much our volume of messages sent has grown
Measuring success and metrics, both reliability and speed

*Might have to use data in prod rather than performance test (flag that we can improve how we measure, but this will take more time).

yaelberger-commits commented 2 years ago

complete batch saving and lambda api before baselining

amazingphilippe commented 2 years ago

Add one card for all phases of canary tests
Make it a blocker for this story

yaelberger-commits commented 2 years ago

Blocked while awaiting metrics

yaelberger-commits commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @andrewleith @ikenna-cds @jimleroyer @jzbahrai @sastels

sastels commented 2 years ago

This is a rather large ticket. Propose breaking up into a few such as

create baseline ADR
implement recommendations
set up tests to run nightly

andrewleith commented 2 years ago

In regards to "Test limits of Notify to know what # of API requests per minute Notify can handle (currently 6,000 emails/minute)", I took a look at our busiest email day in the dump Jimmy provided, and I can see that more than 6000 email notifications a minute were created on that day, 38 different times.

For the minutes that exceeded 6000 email requests, the number went as high as 12226 email notifications created at 00:02.

yaelberger-commits commented 2 years ago

We could turn this card into the Epic for Baselining, and create sub cards for each of the three tasks above. Would you like me to do that @sastels ?

cds-snc / notification-planning

Baseline Notify performance to get a snapshot #351

Description

Acceptance Criteria** (Definition of done)