cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

API gateway timing out on spreadsheet upload #308

Open jimleroyer opened 3 months ago

jimleroyer commented 3 months ago

Describe the bug

The upload of a spreadsheet by IRCC triggered a 502 error on the API gateway, s it seems that the API it taking too long to process the request.

Bug Severity

See examples in the documentation

SEV-2 Major - This impedes user experience, indicates a potential failure upon spreadsheet upload and leads to confusion. The notifications to be sent are successful, hence this is not a critical issue.

To Reproduce

Steps to reproduce the behavior:

  1. Upload a CSV spreadsheet containing 50K notifications, using simulator addresses is sufficient to trigger this.
  2. Click the Send All button!
  3. Get the timeout error.

Expected behavior

No errors

Impact

We get repeated 502 errors on these spreadsheet uploads. This can involve the support team to answer on warning or critical alerts.

QA

Screenshots

No screenshot available for now. The whole page said to retry the operation later, hence the error is not even properly surfaced as the website is blanking.

Additional context

This error was seen repeatedly during the mass send of 1M emails over 2 days by one of our user. The service and template involved.

sastels commented 3 months ago

Looking at the files uploaded, they actually had an extra named column that wasn't in the template. I verified with a small send that this all works (ie the proper columns are used to construct the notification. In this case, they have a Date column that isn't used in the template:

email address,Date,Date_EN,Date_FR,Date_Spanish
success@simulator.amazonses.com,2021-12-10,"December 10, 2021",10 décembre 2021,10 de diciembre de 2021
success@simulator.amazonses.com,2021-12-11,"December 11, 2021",11 décembre 2021,11 de diciembre de 2021

Note that the original files were .xlsx files, Notify read and stored them as csvs in the above format.

sastels commented 3 months ago

recreated template on staging.

Send 50K row file. upload was fine

image.png

After clicking "Send all now" button, eventually got an error

image.png

When went back to dashboard, job was processing normally

sastels commented 3 months ago

Add some logging to help figure out in staging where the time is being spent: https://github.com/cds-snc/notification-api/pull/2130

sastels commented 3 months ago

In Athena with the query

select time, cast(elb_status_code as integer) as code, request_url, target_processing_time as api_lambda_time
from alb_logs
where request_url like '%start-job%'
order by target_processing_time desc
limit 100

We can see the IRCC requests timing out after 29 seconds, but all other recent jobs have taken less than 19 seconds. Don't know how big those jobs were though.

sastels commented 3 months ago

Logging is in staging! Running the big send again we see from this query):

image.png

So checking that the service will not go over its limit is taking about a minute!

Since we are sending emails, this is the code that is taking a minute to run:

   elif template.template_type == EMAIL_TYPE:
        check_email_daily_limit(service, len(list(recipient_csv.get_rows())))
        scheduled_for = datetime.fromisoformat(data.get("scheduled_for")) if data.get("scheduled_for") else None

        if scheduled_for is None or not scheduled_for.date() > datetime.today().date():
            increment_email_daily_count_send_warnings_if_needed(service, len(list(recipient_csv.get_rows())))

There are three things here that could be the issue:

sastels commented 3 months ago

With a bit more logging we get

image.png

So it's taking half the time now. I believe what is happening is that recipient_csv.get_rows() is taking 30 seconds to run. Before we were calling it twice, but in the recent logging PR we used properties of recipient_csv that cache the results, so now it's only being caused once.

sastels commented 3 months ago

Can get the notification count from the s3 object's metadata instead. preparing a PR...

sastels commented 3 months ago

This should fix it! https://github.com/cds-snc/notification-api/pull/2133

sastels commented 3 months ago

This should fix it! https://github.com/cds-snc/notification-api/pull/2135

sastels commented 3 months ago

testing on staging with the big IRCC file now takes about one second with no timeouts.

jimleroyer commented 3 months ago

I will QA this morning.

sastels commented 3 months ago

Jimmy will QA for sure this time!

jimleroyer commented 3 months ago

Tested and working with the same template that IRCC used and with a 50K entries spreadsheet. The upload took a few seconds (~5s) but there were no timing out. The period of time after clicking Send all was almost instantaneous. This can be considered fixed and working.