jimleroyer commented 3 months ago

Description

As a system op of GCNotify, I need to identify current limits of the system, So that I can get past these once on the map.

As a business owner of GCNotify, I need to tell what is the current blocker for scaling up SMS so that I can actually scale up SMS.

WHY are we building?

To scale up SMS further more, given that we now have a short code that is capable of sending 100 SMS/s.

WHAT are we building?

We want to push the limits to around 6,000 SMS/m (100 SMS/s) to fit the short code speed. If there are no errors, then rejoice!

VALUE created by our solution

Identify which limits of technical issues are holding us back to match the short code speed.

Acceptance Criteria

Given the SMS stress test, when an error occurs, then we identify it within a task card with potential follow up actions.

[x] Come up with a testing strategy to test the limits within the GCNotify system (no boto calls)
[x] Come up with a testing strategy to test the limits with AWS (with boto calls)
[x] Crank up the SMS speed to 100 fragments/s.
[x] Identify errors within task cards.

QA Steps

[ ] Review stress test results and potential follow up actions by dev lead.

Questions about AWS (figure out or ask them)

What is the limit that we can make boto calls to send sms through pinpoint?
Say we have 100 long codes. This means we should be able to send 100 SMS fragments per second. If we send at a higher rate for a few seconds, is there some slack? ie will AWS just buffer up the extras and send them later?
Does sending to numbers that don't exist (ex: fictitious 555-01** numbers) hurt our reputation?

sastels commented 3 months ago

with the current limit in staging of 20 celery-sms-send pods we can get up to 1250 SMS / min sends to the internal test number 6135550123. Note that these sends stop BEFORE making the boto call to pinpoint.

sastels commented 2 months ago

Testing pushing up SMS sending on dev: https://docs.google.com/spreadsheets/d/1pQ9SFQZF9wFzX7I0z_Cjxt2S2URqpzedIHryO82Xl6g/edit?gid=959863917#gid=959863917

sastels commented 2 months ago

Didn't work on this Thursday, will do more testing on dev today.

sastels commented 2 months ago

Preliminary: changing rate limit from 1/s to 5/s didn't make a difference. But raising to 10/s sped up the throughput per pod:shrug:

Gathering more data, in particular to make sure everything else stays constant (number of pods, size of test)...

sastels commented 2 months ago

Things are scaling up now. Tested rate limits of 1/s, 5/s, 10/s, and 100/s.

sastels commented 2 months ago

Simulate the AWS network call latency for these tests https://github.com/cds-snc/notification-api/pull/2290

sastels commented 2 months ago

Going to rerun some of the 10/s tests on dev with the new network latency sleep().

sastels commented 2 months ago

So we can get around 6000 sends / min using 15 scaling pods and 10/s task rate limit.

We need to figure out what to do around sms vs fragments (we average 2 fragments / sms in prod). If we send 6000 2-fragment sms per minute AWS will not be happy
We currently only have 51 long codes in prod so we only could do 3000 fragments / minute with the long codes. We may want to get another 49 long codes so that we have the same rate as the short code.

sastels commented 2 months ago

Another thought: we could use 555-01* numbers to do real tests (ie do the complete send to boto

could ask AWS if this will impact our reputation
can send to AWS simulator number

sastels commented 2 months ago

had to do a few things to get dev using Pinpoint
- set Pinpoint pool env vars in send-sms deployments
- change log arns in configuration set (there was a bug in the initial script that is fixed now)

Ran tests using default pool (one number) to 555 numbers to see what happens when we send more than 1 / sec

sending 20: all sent, one per second. So there's some buffering there.
sending 100: 39 sent, 60 ThrottlingException, 60 retried with SNS and sent (SNS has 9 numbers for some reason)

going to get rid of these extra SNS numbers that aren't in the pinpoint pools... Done!

So the SNS retries were just reporting "No quota left for account" so I think that was happening before any potential SNS throttling. Figured out where the SNS SMS monthy quota is and put in a request to raise that from the default $1 / month to $100 / month.

UPDATE: Can do this ourselves https://github.com/cds-snc/notification-terraform/pull/1550

sastels commented 2 months ago

I think we should:

set pod count to support 3000 sends / minute (and 10/s task rate limit)
get 49 more long codes so our default pool will support 6000 fragments / second (same as the short code pool)

Later work can look at figuring out a better way to have a 6000 fragment / min send rate while sending SMS of different fragment sizes

sastels commented 2 months ago

Wrote up scripts to create and delete phone numbers. Will use to test in staging with 100 long codes and roller coaster tests.

sastels commented 2 months ago

:/ We have to match the keywords before we can add a phone number to a pool. On dev everything has the default keywords but on staging and prod we've added new keywords and made everything bilingual. Will have to tweak the phone number adding script (https://github.com/cds-snc/notification-terraform/pull/1551)

ok that's all good, but now I see that there's a limit of max 25 phone numbers for the staging account.

An error occurred (ServiceQuotaExceededException) when calling the RequestPhoneNumber operation: Service Quota Exceeded - Reason="PHONE_NUMBERS_PER_ACCOUNT"

Requested a quota increase.

sastels commented 2 months ago

Will reply to AWS today and also start bumping up long codes in production.

sastels commented 2 months ago

Added a few more long codes in prod. Got back to AWS and waiting on them again.

sastels commented 2 months ago

Hit max of 75 numbers in prod, put in request to AWS for more so we can get to 100 in default-pool.

sastels commented 2 months ago

Possible config for maxing out our prod default pool of 74 numbers:

change task rate limit to 10/s
set in the hpa max 4 scalable celery-sms-send-scalable pods

Don't have the phone numbers in staging to really test, but should:

test this with INTERNAL_TEST number and get an idea of an upper limit on send rate

After release we will monitor system for:

ThrottlingException warnings
max sms send rate

If we get our extra 26 numbers we can bump up the hpa.

sastels commented 1 month ago

Looks like hpa of 3 is a bit better (seeing 2675 npm on dev with INTERNAL_TEST number).

jimleroyer commented 1 month ago

We got 74 long code at the moment that we can test with. We don't have parity with the production environment so it is a bit more difficult to test.

We got 2 AWS tickets opened in support. One asking to up the number phone numbers in staging. Another one to get a Canadian emulation long code (which transformed into an AWS feature request).

sastels commented 1 month ago

Will hopefully release the 4400 fpm config today.

jimleroyer commented 1 month ago

We will bring up our tests to 4,400 fragments / minute today in staging env. This should trigger a few rate failures because of the number of available long codes we got in our staging env pool.

sastels commented 1 month ago

Released to prod. Next step is to wait for AWS to get back to us about phone numbers.

Some notes on testing strategies: https://docs.google.com/document/d/1Gr7r_1_6vIMCM2BDLJsplJWZM25si05fecgkxjGuPjA

sastels commented 1 month ago

We should also figure out what to do about 3-4 fragment SMS. Currently a flood of them will overwhelm our sending capacity and give a lot of ThrottlingExceptions - I'll make a new card.

Max out sms sending without ThrottlingExceptions for 1-4 fragment sms

sastels commented 1 month ago

have 99 long codes in prod now. Will leave in 4400 fps config for a week to verify no problems come up and then bump it up a bit closer to 6000 fpm.

P0NDER0SA commented 1 month ago

We've tested this in Staging, things are in the ballpark and things are looking healthy! We will want to do a nice test in prod to make sure it's healthy as well (because it's already been released to prod).

sastels commented 1 month ago

PR to add billable units to Quicksight to allow us to monitor send rates there https://github.com/cds-snc/notification-terraform/pull/1572

sastels commented 1 month ago

Note that in the last 30 days we've averaged 1.2 fragments / sms, not 2

P0NDER0SA commented 1 month ago

almost done :p

sastels commented 1 month ago

Waiting until Tuesday Oct 16 to monitor new config. After that will bump up to 6000 fragments / minute (assuming 2 fragments / sms).

P0NDER0SA commented 2 weeks ago

OK, after reviewing this one we've determined that we've safely QA'ed this and it can be moved to Done.

cds-snc / notification-planning-core

Push the current SMS limits to 6000 fragments / minute and see what happens #396

Description

WHY are we building?

WHAT are we building?

VALUE created by our solution

Acceptance Criteria

QA Steps

Questions about AWS (figure out or ask them)