culturecreates / incident-reports

Reports on incidents in all products and services
0 stars 0 forks source link

2024-06-24 Database spike #15

Closed saumier closed 2 months ago

saumier commented 3 months ago

Incident Report

Summary

Service started to slow to slightly above 1.5 seconds for the Open API calls to list events. The service gradually grew slower over the next 2 days reaching above 10 seconds at times. Servers were restarted and service returned to normal.

Timeline

2024-06-24 16:00

The incident was opened in Slack channel #footlight-cms after couple of monitoring alerts in Open API were triggered because they took longer than 1.5 seconds to list events.

Screenshot 2024-06-26 at 9 08 23 AM

2024-06-26 9:00

The servers where restarted and systems returned to normal.

Analysis

There was a load spike June 24th that caused production-server-7 to use burst CPU and never fully recover on its own.

The burst from Admin API (not Open API) triggered slow database queries logged at a frequency of 10,000 per second (see graph).

More investigation should be done to find the root cause. At first glance this does not seem to be a case of external Open API DOS attack.

Screenshot 2024-06-26 at 9 24 31 AM

Zoom in to sub-section with high burst. There are only 2 Open API requests but hundreds of thousands of mongo db logs. The Admin API requests are also high (see screen grab above).

Screenshot 2024-06-26 at 9 25 04 AM Screenshot 2024-06-26 at 9 27 23 AM

Lessons learnt

What went well

List of things that went well.

  1. We were alerted to the outage by automated bots before it affected users
  2. Gregory sounded the alarm in Slack

What went wrong

Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items.

  1. We did not fix the problem for 2 days.
  2. Gregory is not able to restart the db in production.

Where we got lucky

These are good things that happened to us but not because we had planned for them.

  1. Clients did not formally complain so we think they didn't notice the slow down.
sahalali commented 3 months ago

As a part of my investigation, the following are the action items created: Action item:

  1. Find the root cause of the issue after monitoring the stats.
  2. Increase the core in the instance. This will require a new instance with 16GB memory and 4 cores. Difference in billing: 16GB instance costs USD84/month and 8GB with 2 core costs 44 USD/month
  3. Monitor database stats: active connections and active queries next time when CPU usage peaks.
  4. Enhancement required on Cache implementation to reduce MongoDB load. Reducing db calls will improve latency.
  5. Monitor the service requests daily for potential DOS attacks over the next few days.

cc @saumier

sahalali commented 3 months ago

As a quick response when the servers are not responding, restarting the instance will work from now to automatically restart the services running. The doc explains how to restart services and instance.