18F / pulse

How the federal .gov domain space is doing at best practices and policies.
Other
94 stars 56 forks source link

Production and Staging have been experiencing downtime #740

Open gbinal opened 6 years ago

gbinal commented 6 years ago

In recent months (it seems like 3-4 times a month) either production or staging will go down (site returns a 500 error with a message saying (e.g.): 404 Not Found: Requested route ('pulse.app.cloud.gov') does not exist.).

The New Relic alerts are catching it and a simple restart of the app on cloud.gov gets it back up, but obviously, this is not a good thing. Unfortunately, in the last week, it's been happening more often.

We're investigating the causes, but some initial ideas for solving it include:

micahsaul commented 6 years ago

Looking at New Relic, it seems like all of the errors I'm seeing are related to someone probing us for vulns. The question, though, is why that would cause the whole site to crash.

konklone commented 6 years ago

I would recommend we bump up the memory.