datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

Investigate the reason behind website crash on 30-31 Aug #247

Closed anuveyatsu closed 5 years ago

anuveyatsu commented 5 years ago

We've seen that website is down on 31 August around 5:30AM GMT. By taking a look at memory usage of frontend service:

screen shot 2018-08-31 at 11 34 45

We've just restarted the frontend service to resolve the problem quickly.

Acceptance criteria

zelima commented 5 years ago

Unfortunately, we don't have much to analyze here since we've to lost logs when redeployed the services, but we've got the email today from GCE about network vulnerabilities that might be related with this. They say

US-CERT recently disclosed security vulnerabilities CVE-2018-5390 and CVE-2018-5391. These are networking vulnerabilities that increase the effectiveness of denial of service (DoS) attacks against vulnerable systems. All Google Kubernetes Engine (GKE) nodes are affected by these vulnerabilities, and we recommend that you upgrade to the latest patch version, as we detail below.

As a action I've upgraded clusters to 1.10.5-gke.4 versions as was recommened in the emails.

Besides We've build the datahub-health service https://travis-ci.org/datahq/datahub-health. that runs on schedual daily and notifyies via Email when something goes wrong. It's scheduled for 12:10 PM GMT. As a result we will be aware of something is wrong withing working day.

Closing as FIXED. Feel free to reopen if this comes up again