diaspd / github-blog

A blog using git hub API
https://github-blog-diaspd.vercel.app
4 stars 0 forks source link

Node.js March 17th Infrastructure #4

Open diaspd opened 1 year ago

diaspd commented 1 year ago

Starting on March 15th and going through to March 17th (with much of the issue being mitigated on the 16th), users were receiving intermittent 404 responses when trying to download Node.js from nodejs.org, or even accessing parts of the website.

What surfaced as one issue, users seeing 404s, was actually the result of multiple compounding issues all happening at once:

The Cloudflare config for nodejs.org had multiple Page Rules configured with "Cache Everything" enabled alongside "Browser/Edge Cache TTLs" being set. This resulted in any 404 responses from the origin server being cached by Cloudflare.

The origin server for nodejs.org was being overloaded, and due to a recent configuration change, was now erroneously returning 404 responses for valid files when it was overloaded and couldn't read them from its disk.

A build script bug caused by the recent migration of the website to Next.js resulted in the site being rebuilt and the entire Cloudflare cache being purged every five minutes, sending an abnormally high number of requests to the origin server and putting it under excessive load.

Together, these resulted in the origin server receiving far more requests than normal, failing to read files from its disk that should've existed, returning incorrect 404 responses, and Cloudflare then caching those 404s.

A large team came together to investigate this issue, with the full fix being put in place at approximately 8 PM UTC on the 16th. Many of the Cloudflare Page Rules that had been removed as part of this fix (that were causing the 404s to be cached) had cache TTLs for four hours, and so the full impact of this incident was not resolved until midnight UTC on the 17th.

Related to the 404s users were receiving, it was also discovered that requests to localised pages on nodejs.org weren't resolving as expected when a localised version didn't exist. Instead of falling back to the English version of the pages, they were returning a 404. A secondary update was made to the NGINX config for nodejs.org on the 17th that fixed this, as well as a further set of updates to reduce the load on the origin server.