saumier commented 11 months ago

Incident Report

Summary

Customer websites had no events because of the Footlight API server https://api.footlight.io/api/ giving 502 Bad Gateway. The issue lasted less than a day. The root cause of the issue was related to the deployment script, as the port number for the Open-Api service in production was not updated to 3030 from 80 during the release of version 1.11.4.

Timeline

2023-09-29 04:36

The Open-API service went down after the release of v1.11.4

[07:47]

@saumier alerts @everyone in the Slack channel about the Open-API being down and no events on the customer websites.

[07:48]

@sahalali starts working on the issue.

[08:25]

@sahalali updated the migration script to fix the issue and did a hotfix and released a new version of the Open API

[08:30]

@saumier marked the incident as resolved.

The following timeline was shared with the client. It was generated with Datadog (our server logging and monitoring software) but the "declared" timestamp is later than when the incident was declared on Slack. Slack was the primary communication tool.

What went well

The team was able to quickly respond after the incident was declared by @saumier and it was fixed quickly as well.

What went wrong

The incident started at 04:36 and we were only able to identify it by 07:47.

Where we got lucky

The incident happened at 04:36 AM and fixed before 08:30 AM so the customer impact was minimal.

Action items

Process improvements

To ensure a seamless release process , we should create a dedicated issue and break down the process into subtasks. These subtasks should address tasks like verifying the staging environment, merging with production, and thoroughly testing client websites and related footlight services after the release.
To ensure faster recognition in the event of a server outage, I would like Datadog to alert at least one team member from each of the regions (Canadian and Indian) since we have different work hours due to the time zone differences. This way, both teams are notified and can take immediate action in case any of our service goes down.

dev-aravind commented 10 months ago

@saumier I've updated the incident report according to my understanding of what happened on that particular day. Please review it and do give suggestions if you want to update this.

saumier commented 10 months ago

@dev-aravind Thanks for writing up this incident report. Your timeline was perfect. I added the "image timeline" only because it was communicated to the client.

I have 2 suggestions to add:

The explanation of what went wrong. You mention that the incident was caused by the deployment script. Why? What prevented the events from being returned by the Footlight API?
The "process improvement" section. Why did this wait for me to detect it in the morning? It should have been detected automatically so the team could have reacted during the daytime in India. So what are the tasks need to have automatic detection and how do we prevent this from happening again?

dev-aravind commented 10 months ago

@saumier I've made changes to the report. Please review it and let me know about what you think about my process improvement ideas.

saumier commented 10 months ago

@dev-aravind Good improvement. Thx. Lets discuss at the next standup how we can have all our team receive the emails from Datadog. We have a list called support@culturecreates.com that includes all our development team. The emails for P1 could be sent to that list instead of admin@culturecreats.com. Once this is setup, I will close this issue.

saumier commented 10 months ago

We decided to use support@culturecreates.com for all Datadog monitors and synthetic tests.

culturecreates / incident-reports

2023-09-29 Customer websites had no events #8

Incident Report

Summary

Timeline

2023-09-29 04:36

[07:47]

[07:48]

[08:25]

[08:30]

What went well

What went wrong

Where we got lucky

Action items

Process improvements