Open jimleroyer opened 3 months ago
Looking at the files uploaded, they actually had an extra named column that wasn't in the template. I verified with a small send that this all works (ie the proper columns are used to construct the notification. In this case, they have a Date
column that isn't used in the template:
email address,Date,Date_EN,Date_FR,Date_Spanish
success@simulator.amazonses.com,2021-12-10,"December 10, 2021",10 décembre 2021,10 de diciembre de 2021
success@simulator.amazonses.com,2021-12-11,"December 11, 2021",11 décembre 2021,11 de diciembre de 2021
Note that the original files were .xlsx
files, Notify read and stored them as csvs in the above format.
recreated template on staging.
Send 50K row file. upload was fine
After clicking "Send all now" button, eventually got an error
When went back to dashboard, job was processing normally
Add some logging to help figure out in staging where the time is being spent: https://github.com/cds-snc/notification-api/pull/2130
In Athena with the query
select time, cast(elb_status_code as integer) as code, request_url, target_processing_time as api_lambda_time
from alb_logs
where request_url like '%start-job%'
order by target_processing_time desc
limit 100
We can see the IRCC requests timing out after 29 seconds, but all other recent jobs have taken less than 19 seconds. Don't know how big those jobs were though.
Logging is in staging! Running the big send again we see from this query):
So checking that the service will not go over its limit is taking about a minute!
Since we are sending emails, this is the code that is taking a minute to run:
elif template.template_type == EMAIL_TYPE:
check_email_daily_limit(service, len(list(recipient_csv.get_rows())))
scheduled_for = datetime.fromisoformat(data.get("scheduled_for")) if data.get("scheduled_for") else None
if scheduled_for is None or not scheduled_for.date() > datetime.today().date():
increment_email_daily_count_send_warnings_if_needed(service, len(list(recipient_csv.get_rows())))
There are three things here that could be the issue:
recipient_csv.get_rows()
might take a while, and we are calling it twice to get the length of the file. There might be a better way. In particular, recipient_csv
has a __len__
defined, so we should be able to just do len(recipient_csv)
.check_email_daily_limit()
increment_email_daily_count_send_warnings_if_needed()
With a bit more logging we get
So it's taking half the time now. I believe what is happening is that recipient_csv.get_rows()
is taking 30 seconds to run. Before we were calling it twice, but in the recent logging PR we used properties of recipient_csv
that cache the results, so now it's only being caused once.
Can get the notification count from the s3 object's metadata instead. preparing a PR...
This should fix it! https://github.com/cds-snc/notification-api/pull/2133
This should fix it! https://github.com/cds-snc/notification-api/pull/2135
testing on staging with the big IRCC file now takes about one second with no timeouts.
I will QA this morning.
Jimmy will QA for sure this time!
Tested and working with the same template that IRCC used and with a 50K entries spreadsheet. The upload took a few seconds (~5s) but there were no timing out. The period of time after clicking Send all was almost instantaneous. This can be considered fixed and working.
Describe the bug
The upload of a spreadsheet by IRCC triggered a 502 error on the API gateway, s it seems that the API it taking too long to process the request.
Bug Severity
See examples in the documentation
SEV-2 Major - This impedes user experience, indicates a potential failure upon spreadsheet upload and leads to confusion. The notifications to be sent are successful, hence this is not a critical issue.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
No errors
Impact
We get repeated 502 errors on these spreadsheet uploads. This can involve the support team to answer on warning or critical alerts.
QA
Screenshots
No screenshot available for now. The whole page said to retry the operation later, hence the error is not even properly surfaced as the website is blanking.
Additional context
This error was seen repeatedly during the mass send of 1M emails over 2 days by one of our user. The service and template involved.