Callback URLs timeout and slows down all Celery workers

jimleroyer commented 2 months ago

Describe the bug

Because a callback URL that times out indefinitely block a thread and that Python works with one thread only via the GIL, GCNotify Celery workers that handle the callbacks and all tasks executed in the same space will slow down to a crawl. Workers for saving notification creations into the database also live into this space and hence the timing out of callbacks will prevent these, effectively blocking notifications pipeline processing. This has the potential to bring the system down to a point where even the status heartbeats go missing.

Bug Severity

SEV-1 Critical

To Reproduce

Steps to reproduce the behavior:

Misconfigure a test service with an API callback URL that times out indefinitely (over 60 seconds at least).
Sends a batch of notifications for over 15 minutes, to bring the system to a crawl.

Expected behavior

The system should still process the misconfigured callbacks without bringing it to a crawl.

Impact

Impact on Notify users: Notifications can block and not send.

Impact on Recipients: Notifications are delayed for as long as support team does not remove the timing out callback URL.

Impact on Notify team: Support time and violating our SLOs.

Additional context

There are a few changes that GDS did to amortize this risk.

Reduce timeout for service callback attempt to 5 seconds
Put service callback retries on a different queue
retry service callbacks on 429
Isolating the Celery workers from the other ones as it has a higher risk of leading to threads/processes exhaustion

jimleroyer commented 2 months ago

PR with upstream changes for Reduce timeout for service callback attempt to 5 seconds and Put service callback retries on a different queue: https://github.com/cds-snc/notification-api/pull/2153

jimleroyer commented 2 months ago

This was merged into staging and tested. We could replicate the exact issue before the fix and afterward, the issue is mitigated and not fixed. It removes a lot of the pressure on the workers. Further work could be done to increase the mitigation but this is very good fix.

This is ready for review for the staging environment.

jimleroyer commented 2 months ago

Pushed a related change but not a direct fix for this issue, which came from GDS upstream code repositories: https://github.com/cds-snc/notification-api/pull/2155

sastels commented 2 months ago

Steve will QA this

sastels commented 2 months ago

Steve QA'd!

cds-snc / notification-planning-core