culturecreates / incident-reports

Reports on incidents in all products and services
0 stars 0 forks source link

2023-09-03 Corpo Website Down #7

Closed saumier closed 10 months ago

saumier commented 12 months ago

Incident Report

Summary

Corporate website was unavailable for about 3 days. Our monitoring tool, Datadog, checked every 30 minutes and reported that the Response Time exceeded 2 seconds (2000 ms) reaching around 0.5 minutes to load a page.

Screenshot 2023-09-05 at 12 02 27 PM

Timeline

Start: Saturday , Sep 02, 2023, 2AM ET End: Tuesday, Sep 06, 6AM ET

Lessons learnt

We learnt that LightSail has a limited 'CPU Burst Capacity', and when this is exceeded, the performance of the website drops to unacceptable (20 sec) response time.

Although Lightsail instances display 2, 4, 8 vCPUs included in the price, you can’t use 100% of these CPU cores. Lightsail uses something called the ‘CPU Burst Capacity’ model to allocate resources to all instances. https://nestify.io/blog/aws-lightsail-wordpress/

In this case, the metrics on LightSail point to a sudden increase on the calls to the website, moving the CPU into the burstable zone on September 2nd, and then leading to a drain of all the CPU burst capacity. The burst capacity is meant to be short. The sustained CPU burst shown (see graph below) used up all the accumulated Burst capacity.

Screenshot 2023-09-04 at 10 29 18 PM

It is not clear why the traffic to the website suddenly increased, lasted for 3 days, and then went away. Nothing was changed on our side. A guess is a random bot indexing or attack or automatic software update causing the external network traffic to rise.

Here are the metrics of the incoming network traffic that corresponds to the dates of the down time.

Screenshot 2023-09-04 at 10 33 10 PM

Action items

Caitlin and Tammy are planning to redo the corporate website and have mandated our graphic artist/UX designer to propose ideas. We should wait for the plan for the new website before working on the hosting improvements.

saumier commented 3 months ago

This issue happened again in May 2024 lasting over 24 hrs. As a result, a CloudFront distribution was placed in front of the LightSail instance. SSL and setup documented here https://github.com/culturecreates/culture-creates-wiki/wiki/How-to-setup-SSL-certificate-on-culturecreates.com-Wordpress

Screenshot 2024-05-16 at 8 39 53 AM

Corpo Website Hosting Diagram drawio

Screenshot 2024-05-17 at 11 42 42 AM