cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Push the current SMS limits to 6000 fragments / minute and see what happens #396

Open jimleroyer opened 1 month ago

jimleroyer commented 1 month ago

Description

As a system op of GCNotify, I need to identify current limits of the system, So that I can get past these once on the map.

As a business owner of GCNotify, I need to tell what is the current blocker for scaling up SMS so that I can actually scale up SMS.

WHY are we building?

To scale up SMS further more, given that we now have a short code that is capable of sending 100 SMS/s.

WHAT are we building?

We want to push the limits to around 6,000 SMS/m (100 SMS/s) to fit the short code speed. If there are no errors, then rejoice!

VALUE created by our solution

Identify which limits of technical issues are holding us back to match the short code speed.

Acceptance Criteria

Given the SMS stress test, when an error occurs, then we identify it within a task card with potential follow up actions.

QA Steps

Questions about AWS (figure out or ask them)

  1. What is the limit that we can make boto calls to send sms through pinpoint?
  2. Say we have 100 long codes. This means we should be able to send 100 SMS fragments per second. If we send at a higher rate for a few seconds, is there some slack? ie will AWS just buffer up the extras and send them later?
  3. Does sending to numbers that don't exist (ex: fictitious 555-01** numbers) hurt our reputation?
sastels commented 1 month ago

with the current limit in staging of 20 celery-sms-send pods we can get up to 1250 SMS / min sends to the internal test number 6135550123. Note that these sends stop BEFORE making the boto call to pinpoint.

image.png
sastels commented 1 week ago

Testing pushing up SMS sending on dev: https://docs.google.com/spreadsheets/d/1pQ9SFQZF9wFzX7I0z_Cjxt2S2URqpzedIHryO82Xl6g/edit?gid=959863917#gid=959863917

sastels commented 4 days ago

Didn't work on this Thursday, will do more testing on dev today.

sastels commented 4 days ago

Preliminary: changing rate limit from 1/s to 5/s didn't make a difference. But raising to 10/s sped up the throughput per pod:shrug:

Gathering more data, in particular to make sure everything else stays constant (number of pods, size of test)...

sastels commented 4 days ago

Things are scaling up now. Tested rate limits of 1/s, 5/s, 10/s, and 100/s.

sastels commented 3 days ago

Simulate the AWS network call latency for these tests https://github.com/cds-snc/notification-api/pull/2290

sastels commented 3 days ago

Going to rerun some of the 10/s tests on dev with the new network latency sleep().

sastels commented 2 days ago

So we can get around 6000 sends / min using 15 scaling pods and 10/s task rate limit.

sastels commented 2 days ago

Another thought: we could use 555-01* numbers to do real tests (ie do the complete send to boto

sastels commented 2 days ago

Ran tests using default pool (one number) to 555 numbers to see what happens when we send more than 1 / sec

going to get rid of these extra SNS numbers that aren't in the pinpoint pools... Done!

So the SNS retries were just reporting "No quota left for account" so I think that was happening before any potential SNS throttling. Figured out where the SNS SMS monthy quota is and put in a request to raise that from the default $1 / month to $100 / month.

sastels commented 1 day ago

I think we should:

Later work can look at figuring out a better way to have a 6000 fragment / min send rate while sending SMS of different fragment sizes