culturecreates / incident-reports

Reports on incidents in all products and services
0 stars 0 forks source link

2023-09-29 Event details showing 404 on toutculture.ca #9

Closed saumier closed 10 months ago

saumier commented 11 months ago

Image

Incident Report

Summary

The client website toutculture.ca appeared to be running normally except when a user clicked to view an event details page resulting in a 404 error. This was caused by the server running in an infinite loop and finally timing out due to unintended data.(A place was created with containedInPlace field as the original place itself)

Timeline

2023-09-29 13:52

Tout culture website was showing 404 on events page.

[14:20]

@saumier alerts @everyone in the Slack channel about the tout-culture website not showing events.

[14:35]

@saumier restarts the production servers in attempt to resolve the issue, but the docker containers were not restarted. @saumier reports this to @sahalali.

[14:53]

@sahalali Manually restarts the containers but the event details were still not being fetched for tout-culture. (This was handled in this issue so that the services restart automatically when Lightsail restarts )

[15:42]

The issue as marked as resolved in Datadog as after removing some malformed place data from the production database temporarily fixed the issue.

[16:09]

@saumier opens issue to further look into this issue and also lists the places we need to restore after the issue was fixed.

2023-10-09 00:40

@dev-aravind worked on the issue found out the root cause being a place entity calling on itself as the containedInPlace causing an infinite loop. @dev-aravind also makes backend changes to prevent a user to set containedInPlace as the original place itself.

[04:58]

@AbhishekPAnil makes UI changes to not list the original place in the containedInPlace selection dropdown.

2023-10-10 04:51

@saumier Restores the places except the one place that contained itself.

Lessons learnt

What went well

  1. The incident was resolved in under 2 hours.
  2. The client Tout Culture was impressed by the team's quick response.

What went wrong

  1. The backend and frontend validations were missed by our team for this edge case.
  2. Other constraints slowed down the resolution of this issue such as the docker containers not restarting immediately.

Where we got lucky

  1. Removing a few places (place that contained a link to itself creating the infinite loop) temporarily fixed the issue.

Action items

Process improvements

  1. Adding more unit tests to the Open-API side.

  2. An automated test to check if all the possible GET APIs are returning data without errors as part of the deployment. The test should return the entities which might have inconsistencies so that the team can look into it. Deployment to production should only be done if this test passes.

saumier commented 10 months ago

@sahalali We should fill in the details of this incident report.

dev-aravind commented 10 months ago

@saumier I've updated the incident report too. Please review it and do give suggestions if you want to update this.

saumier commented 10 months ago

@dev-aravind Same suggestions. Please add ideas on how to improve the process.

dev-aravind commented 10 months ago

@saumier I've made changes to the report. Please review it and let me know about what you think about my process improvement ideas.

saumier commented 10 months ago

@dev-aravind Good job in adding the improvements ideas. Can you take it a step further by:

  1. adding the actual unit test that would have caught this error (a place with containedInPlace of itself)
  2. confirming that if the unit tests don't pass then the deployment to production fails.

For your second action item, I think this ties in nicely with the regression tests that Kim is setting up. His tests will check that the normal features continue to work with each build. Once Kim has them working (in a couple weeks) then we should discuss how to automatically run them using GitHub workflows.

There is one more thing I was thinking about. An incident like this that can appear over time and not always at the moment of deployment into production could be detected using a scheduled Datadog synthetic test. We already have one for Open API swagger page, but none for events and event details. Could you figure out how to call GET calendar/{calendar id}/events and GET /events/{event id} keeping in mind that the calendar and event ids are changing. Perhaps you can take a look at the Datadog sythetic tests and see how we could have a regular test of Open API for calendar events and event details.

dev-aravind commented 10 months ago

@saumier The unit test is up in staging right now and the datadog synthetics tests are also modified as per your suggestions. Please review them and let me know if you want any changes.

saumier commented 10 months ago

@dev-aravind Looks good for the Datadog synthetics. I added a tag product:footlight so we can filter to see all the parts of Footlight CMS in production.

Follow up questions:

  1. In which file did you add the unit test (a place with containedInPlace of itself)? I looked in event.service.spec.ts but couldn't find it.

  2. Any progress on the @sahalali idea of setting up Nest.js health endpoint?

dev-aravind commented 10 months ago

@saumier this is the test file

saumier commented 10 months ago

@dev-aravind When I look at the test, it seems that a new Place can still be created (async addPlace) with a containedInPlace that is the same id as the Place. As this can happen when Aggregator creates new places, it could shut down our servers like it did before. If I am right, then please create a test and then fix the code and check that the test passes. Let me know if this makes sense.

dev-aravind commented 10 months ago

@saumier A new test pair of tests are up in staging and also I also added some validation to further check the containedInPlace issue.

https://github.com/culturecreates/footlight-calendar-api/pull/715/files