datamade / how-to

📚 Doing all sorts of things, the DataMade way
MIT License
81 stars 12 forks source link

Something is causing dynamic apps on Hot Dog Princess to become inaccessible for 15 minutes each Sunday #230

Closed hancush closed 1 year ago

hancush commented 2 years ago

Description

We get notifications of downtime through Uptime Robot (#125). Every Sunday at 11 p.m. Central Standard Time / 12 a.m. Central Daylight Time / 5 a.m. UTC, we get notifications that several apps are down. These almost always resolve in 15 minutes or less.

Looking more closely at the notifications, they represent every app deployed on Hot Dog Princess for which we've configured notifications – except for My Reps.

I don’t see evidence that the server is restarting:

ubuntu@ip-10-0-0-208:~$ uptime
 13:27:47 up 94 days, 20:22,  1 user,  load average: 0.02, 0.06, 0.06

Back to My Reps, it’s a static HTML/JavaScript site that we serve directly through Nginx, i.e., it does not use Supervisor. That might be a clue, but I don’t see anything in the Supervisor logs indicating that it is restarting ever, let alone every week.

I've also looked at the deployed crons. Our weekly crons scripts are configured to run at 6:47 a.m. UTC / 1:47 a.m. CDT. This does not align with the downtime. Neither do any of the regular crons.

I've also checked CloudWatch and Lambda in the AWS Console and do not see any automated maintenance tasks.

I'm a bit stumped. I don't really want to turn off the downtime notifications because they are useful throughout the week. But I'm also not sure where to look next.

hancush commented 2 years ago

I'll find a few minutes to pair with @fgregg on this, this cycle.

If we can't figure it out (and even if we can), we might want to escalate the priority of migrating existing apps from legacy infrastructure to our contemporary hosting patterns. I imagine this may have some interaction with whether we have active hosting/maintenance contracts with the impacted clients, as Heroku is more expensive than AWS.

smcalilly commented 1 year ago

@hancush is this still relevant?

hancush commented 1 year ago

Nah, I think we can close it.