culturecreates / incident-reports

Reports on incidents in all products and services
0 stars 0 forks source link

2024-06-16 IR-6: Footlight CMS slow #14

Closed saumier closed 2 months ago

saumier commented 2 months ago

Incident Report

Summary

Response times for clients sites calling event listings were very slow from 5 to 10 seconds on average. This problem went away after 3 days. It was caused by Culture Mauricie who made about 2,000 GETs for event listings in 1 minute on June 13 at 4AM (ET?). This seems to have caused our database to jump into a CPU burst for which it only recovered after 3 days.

Timeline

See Slack https://culturecreates.slack.com/archives/C02B18SN3FU/p1718468746756419

See Datadog https://app.datadoghq.com/incidents/6

Screenshot 2024-06-16 at 12 30 20 PM

Recovery showing CPU within sustainable range:

Screenshot 2024-06-16 at 11 55 58 AM
saumier commented 2 months ago

Here is some data on the load that caused the incident:

Datadog

[32m[Nest] 77 - [39m06/11/2024, 4:10:03 PM [32m LOG [39m [38;5;3m[HTTP] [39m[32m Responded to GET /calendars/culture-mauricie/events?limit=15&page=15 with http 200 - ::ffff:172.17.0.1 [39m

Graph showing calls per second. Peak at 524 calls per second from IP 172.17.0.1

Screenshot 2024-06-17 at 8 54 48 AM