cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Push the limits for the SMS sending rate using Pinpoint #402

Open jimleroyer opened 3 months ago

jimleroyer commented 3 months ago

Description

As an ops lead, I want to know what is the highest rate limit I can send SMS with GCNotify, So that I know how far we can push the system.

As a business owner, I need to know the current GCNotify SMS sending rate limit, So that I can adjust the daily and annual SMS limits.

WHY are we building?

We need to increase our SMS sending limit and for that, we need to know our latest capacity with the introduction of Pinpoint as a sending mechanism and the short code acquisition.

WHAT are we building?

We are testing as high as we can send of SMS via AWS pinpoint. Hence we might want to increase the number of Kubernetes pods and adjust our Karpenter/scale set configuration.

VALUE created by our solution

The ability for each service to send more SMS per year and on a daily manner.

Acceptance Criteria

QA Steps

sastels commented 3 months ago

Testing first with internal test number. So going through Notify but NOT doing the boto call to Pinpoint. This will give us an idea of how much we can send before we hit Notify bottlenecks (database / network / k8s / ?).

Tested by uploading large (20K-40K) csvs of sms to 6135550123.

Data moved to this document

Summary: (note that the current state in production is 20 scalable pods)

primary pods scalable pods total pods internal send rate rate / pods
3 20 23 1250 54
3 30 33 1800 55
3 40 43 2320 54
sastels commented 3 months ago

staging remains set at max 40 scalable pods.

Future work:

We're still rate limiting sms to about one per second. Since we're getting about 54 sms / minute per pod, this rate limit appears to be per pod and not per worker? We should increase this rate limit and see what results we get. For example, say we keep 30 pods and triple the rate limit to 3/s. This setting is in the .env files, currently for both staging and production we have

CELERY_DELIVER_SMS_RATE_LIMIT=1/s

Also: I think we should throw sleep statements into celery 😱

sastels commented 2 months ago
sastels commented 2 months ago

These deliver_sms are only taking about 0.1 seconds to run) so I'm not sure why the pods are only each running one per second

sastels commented 2 months ago

Will continue to investigate and talk offline

sastels commented 2 months ago

Will continue to investigate and talk offline

sastels commented 2 months ago

Working on dev since it's back up and we can manually poke at it!

sastels commented 2 months ago

changed the dev celery-sms-send-scalable and celery-sms-send-primary deployments to have CELERY_DELIVER_SMS_RATE_LIMIT set to "100/s"

sastels commented 2 months ago

Got dev "sending" around 8000 or so SMS per minute to our internal test number (ie at the end not giving it to AWS to send) by using 43 pods and setting the task rate limit to 100/s. 32 (out of 40K) are hung though, so probably pushed the system a bit too hard. :this-is-fine-fire:

P0NDER0SA commented 2 months ago

Steve is gonna get to this one today!

sastels commented 2 months ago

Going to switch to Push the current SMS limits to trigger potential errors

for now - this other card focusses on getting to 6000 SMS / min, which is a good first step before pushing higher.

sastels commented 2 months ago

bumped up pods in staging to do some tests.

sastels commented 2 months ago

Using 27 sms-send-scalable pods, Running roller coaster test while occasionally uploading 40K sms. Getting up to 10K notifications / min send rate.

sastels commented 1 month ago

Some notes on testing strategies: https://docs.google.com/document/d/1Gr7r_1_6vIMCM2BDLJsplJWZM25si05fecgkxjGuPjA