Closed jimleroyer closed 5 days ago
with the current limit in staging of 20 celery-sms-send pods we can get up to 1250 SMS / min sends to the internal test number 6135550123. Note that these sends stop BEFORE making the boto call to pinpoint.
Testing pushing up SMS sending on dev: https://docs.google.com/spreadsheets/d/1pQ9SFQZF9wFzX7I0z_Cjxt2S2URqpzedIHryO82Xl6g/edit?gid=959863917#gid=959863917
Didn't work on this Thursday, will do more testing on dev today.
Preliminary: changing rate limit from 1/s to 5/s didn't make a difference. But raising to 10/s sped up the throughput per pod:shrug:
Gathering more data, in particular to make sure everything else stays constant (number of pods, size of test)...
Things are scaling up now. Tested rate limits of 1/s, 5/s, 10/s, and 100/s.
Simulate the AWS network call latency for these tests https://github.com/cds-snc/notification-api/pull/2290
Going to rerun some of the 10/s tests on dev with the new network latency sleep().
So we can get around 6000 sends / min using 15 scaling pods and 10/s task rate limit.
Another thought: we could use 555-01* numbers to do real tests (ie do the complete send to boto
Ran tests using default pool (one number) to 555 numbers to see what happens when we send more than 1 / sec
going to get rid of these extra SNS numbers that aren't in the pinpoint pools... Done!
So the SNS retries were just reporting "No quota left for account" so I think that was happening before any potential SNS throttling. Figured out where the SNS SMS monthy quota is and put in a request to raise that from the default $1 / month to $100 / month.
UPDATE: Can do this ourselves https://github.com/cds-snc/notification-terraform/pull/1550
I think we should:
Later work can look at figuring out a better way to have a 6000 fragment / min send rate while sending SMS of different fragment sizes
Wrote up scripts to create and delete phone numbers. Will use to test in staging with 100 long codes and roller coaster tests.
:/ We have to match the keywords before we can add a phone number to a pool. On dev everything has the default keywords but on staging and prod we've added new keywords and made everything bilingual. Will have to tweak the phone number adding script (https://github.com/cds-snc/notification-terraform/pull/1551)
ok that's all good, but now I see that there's a limit of max 25 phone numbers for the staging account.
An error occurred (ServiceQuotaExceededException) when calling the RequestPhoneNumber operation: Service Quota Exceeded - Reason="PHONE_NUMBERS_PER_ACCOUNT"
Will reply to AWS today and also start bumping up long codes in production.
Added a few more long codes in prod. Got back to AWS and waiting on them again.
Hit max of 75 numbers in prod, put in request to AWS for more so we can get to 100 in default-pool.
Possible config for maxing out our prod default pool of 74 numbers:
Don't have the phone numbers in staging to really test, but should:
After release we will monitor system for:
If we get our extra 26 numbers we can bump up the hpa.
Looks like hpa of 3 is a bit better (seeing 2675 npm on dev with INTERNAL_TEST number).
We got 74 long code at the moment that we can test with. We don't have parity with the production environment so it is a bit more difficult to test.
We got 2 AWS tickets opened in support. One asking to up the number phone numbers in staging. Another one to get a Canadian emulation long code (which transformed into an AWS feature request).
Will hopefully release the 4400 fpm config today.
We will bring up our tests to 4,400 fragments / minute today in staging env. This should trigger a few rate failures because of the number of available long codes we got in our staging env pool.
Released to prod. Next step is to wait for AWS to get back to us about phone numbers.
Some notes on testing strategies: https://docs.google.com/document/d/1Gr7r_1_6vIMCM2BDLJsplJWZM25si05fecgkxjGuPjA
We should also figure out what to do about 3-4 fragment SMS. Currently a flood of them will overwhelm our sending capacity and give a lot of ThrottlingExceptions - I'll make a new card.
Max out sms sending without ThrottlingExceptions for 1-4 fragment sms
have 99 long codes in prod now. Will leave in 4400 fps config for a week to verify no problems come up and then bump it up a bit closer to 6000 fpm.
We've tested this in Staging, things are in the ballpark and things are looking healthy! We will want to do a nice test in prod to make sure it's healthy as well (because it's already been released to prod).
PR to add billable units to Quicksight to allow us to monitor send rates there https://github.com/cds-snc/notification-terraform/pull/1572
Note that in the last 30 days we've averaged 1.2 fragments / sms, not 2
almost done :p
Waiting until Tuesday Oct 16 to monitor new config. After that will bump up to 6000 fragments / minute (assuming 2 fragments / sms).
New related cards: Monitor SMS send rate with new config Max out sms sending without ThrottlingExceptions for 1-4 fragment sms
OK, after reviewing this one we've determined that we've safely QA'ed this and it can be moved to Done.
Description
As a system op of GCNotify, I need to identify current limits of the system, So that I can get past these once on the map.
As a business owner of GCNotify, I need to tell what is the current blocker for scaling up SMS so that I can actually scale up SMS.
WHY are we building?
To scale up SMS further more, given that we now have a short code that is capable of sending 100 SMS/s.
WHAT are we building?
We want to push the limits to around 6,000 SMS/m (100 SMS/s) to fit the short code speed. If there are no errors, then rejoice!
VALUE created by our solution
Identify which limits of technical issues are holding us back to match the short code speed.
Acceptance Criteria
Given the SMS stress test, when an error occurs, then we identify it within a task card with potential follow up actions.
QA Steps
Questions about AWS (figure out or ask them)