Major Site Outage - Githubissues

dkfellows commented 8 years ago

The site is very down and has been so for a substantial amount of time. This would have been a non-reportable SNAFU under Discourse, but with NodeBB it is indicative of a serious problem.

No idea what is wrong.

LB-- commented 8 years ago

BenLubar commented 8 years ago

I disabled the notifications list and the cooties seem to have stopped.

And by "disabled" I mean I added this after this line:

                // TDWTF DEBUG 2016-03-29
                return callback(null, []);

/cc @julianlam

julianlam commented 8 years ago

Git hash plus applicable error stack traces please and thank you :smile:

BenLubar commented 8 years ago

@julianlam No stack traces, but I added this:

https://github.com/BenLubar/NodeBB/commit/117b8d2cb27cfc8874e8b92adaf1d453b26ef026

I updated from NodeBB 1783a07 to e99d952 during the cooties but they didn't stop until I disabled that function.

Here are some snippets from IRC:

13:54 < BenLubar> 29/3 18:53 [39] - warn: [socket.io] slow callback - 1044732ms 
                  - uid: [redacted] - ip: [redacted] - event: notifications.get 
                  - params: null - err: null
...
14:57 < BenLubar> 1174710ms for notifications.get

BenLubar commented 8 years ago

https://community.nodebb.org/topic/8396/issues-with-nodebb-perfomance

julianlam commented 8 years ago

Odd, if it was a crash from that bug, it would've been fixed in more recent commits.

BenLubar commented 8 years ago

Ah, I hadn't read the stack traces in that topic very closely. It is indeed a separate bug.

dkfellows commented 8 years ago

I do not know if the notifications thing was the problem or the symptom, but the continued performance difficulties would tend to indicate that it was merely the symptom. For example, I'm currently seeing extremely long times to load small topics. Eventually I get a 504 Gateway Timeout sometimes, or the topic loads. It's apparently arbitrary which happens.

Hunting performance problems can be hard. The only way to do it is to keep on improving the instrumentation you're applying to service calls in the hope of catching the trouble red-handed.

LB-- commented 8 years ago

DoctaJonez commented 8 years ago

Do we know if the storm affected all 4 NodeBB instances, or just specific ones? If it affected all 4 instances simultaneously, that would tend to suggest a single point of failure, like Mongo db.

If the slowdown was in the Node backend, I'd expect individual instances to suffer while other instances are still OK.

Do we have any way of profiling the database and the node instances to see what's going on?

It'd be useful to profile the hosts to see if CPU, Memory, Disk access or Network bandwidth is being saturated.

boomzillawtf commented 8 years ago

Search may also be a factor. I remember it being very slow yesterday and at least one user reported cootie storms when he started searching.

julianlam commented 8 years ago

Is everything one one t2.medium? May be a better idea to put the database elsewhere if it isn't already.

llouviere commented 8 years ago

Timetable on enabling notification dropdown?

Is there an alternative?

BenLubar commented 8 years ago

screenshot 2016-03-30 at 11 23 53 screenshot 2016-03-30 at 11 22 38 screenshot 2016-03-30 at 11 20 43

It looks like we need more disk IO operations per second allocated.

AccaliaDeElementia commented 8 years ago

ouch, yeah that disc queue length is.... off the charts. ideally you want that sucker sollidly under 1.0, we seem to be north of 10 regularly.... no wonder we have issues!

AccaliaDeElementia commented 8 years ago

@llouviere alternative for notifications?

i think i can come up with something...

LB-- commented 8 years ago

Forums seem to be completely offline. SSL error over HTTPS and 404 over HTTP. Maintenance, I assume?

BenLubar commented 8 years ago

I stopped the AWS instance so I could snapshot the disk. Currently about 50% done. I hope the IP won't change when it comes back up.

llouviere commented 8 years ago

We could just use GitHub for our forums.

Pretty good reliability here.

pauljherring commented 8 years ago

=-o

LB-- commented 8 years ago

julianlam commented 8 years ago

@BenLubar Any chance you can just use nginx to send Googlebot a 503 temporarily?

http://stackoverflow.com/questions/2786595/what-is-the-correct-http-status-code-to-send-when-a-site-is-down-for-maintenance

Edit: Maybe they'll respond to an HTTP 429

BenLubar commented 8 years ago

I've rate limited anyone with a user-agent matching /(bot|spider|slurp|crawler)/i to 1 request every 10 seconds (per IP) in nginx.

BenLubar commented 8 years ago

Looks like the rate limiting handled the bots and the better disk handled the large number of humans.

dkfellows commented 8 years ago

And another one; this one has been about an hour long so far (and started about 5 minutes before I attempted to use the site).

DoctaJonez commented 8 years ago

Does this need opening as a new issue so it gets noticed? I think it was possibly a bit optimistic closing it in the first place.

The site seems to have been down for quite some time now.

(cc @BenLubar)

BenLubar commented 8 years ago

I just woke up and the site seems fine right now. Here's a graph:

screenshot 2016-04-04 at 08 09 43

AccaliaDeElementia commented 8 years ago

so.... what happened between 0230 and 0440 then?

DoctaJonez commented 8 years ago

so.... what happened between 0230 and 0440 then?

Could it be backups? Maintenence plans (log file purging, etc)? Software updates?

AccaliaDeElementia commented 8 years ago

could be any of them. i asked because i don't know what it was and want to.

:-P

dkfellows commented 8 years ago

It hit again this morning (and is still ongoing as I write this) commencing at about 07:50 GMT. The nginx front-end seems to be sometimes fast to respond and sometimes not, like there's a resource exhaustion problem, but the back-end is solidly not responsive.

dkfellows commented 8 years ago

Also, I assume you have read http://redis.io/topics/lru-cache about maxmemory tuning? If not, you really need to as redis defaults to being a memory hog.

BenLubar commented 8 years ago

maxmemory has been set to 100mb and maxmemory-policy to allkeys-lru since before this last set of cooties started.

julianlam commented 8 years ago

@BenLubar output of ss -s? Run mongostat and see if there are items in the mongo write/read queue...

apxltd / what-bugs

Major Site Outage #66