Open jimleroyer opened 2 months ago
PR with upstream changes for Reduce timeout for service callback attempt to 5 seconds
and Put service callback retries on a different queue
:
https://github.com/cds-snc/notification-api/pull/2153
This was merged into staging and tested. We could replicate the exact issue before the fix and afterward, the issue is mitigated and not fixed. It removes a lot of the pressure on the workers. Further work could be done to increase the mitigation but this is very good fix.
This is ready for review for the staging environment.
Pushed a related change but not a direct fix for this issue, which came from GDS upstream code repositories: https://github.com/cds-snc/notification-api/pull/2155
Steve will QA this
Steve QA'd!
Describe the bug
Because a callback URL that times out indefinitely block a thread and that Python works with one thread only via the GIL, GCNotify Celery workers that handle the callbacks and all tasks executed in the same space will slow down to a crawl. Workers for saving notification creations into the database also live into this space and hence the timing out of callbacks will prevent these, effectively blocking notifications pipeline processing. This has the potential to bring the system down to a point where even the status heartbeats go missing.
Bug Severity
SEV-1 Critical
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The system should still process the misconfigured callbacks without bringing it to a crawl.
Impact
Impact on Notify users: Notifications can block and not send.
Impact on Recipients: Notifications are delayed for as long as support team does not remove the timing out callback URL.
Impact on Notify team: Support time and violating our SLOs.
Additional context
There are a few changes that GDS did to amortize this risk.