Closed dkfellows closed 8 years ago
I disabled the notifications list and the cooties seem to have stopped.
And by "disabled" I mean I added this after this line:
// TDWTF DEBUG 2016-03-29
return callback(null, []);
/cc @julianlam
Git hash plus applicable error stack traces please and thank you :smile:
@julianlam No stack traces, but I added this:
https://github.com/BenLubar/NodeBB/commit/117b8d2cb27cfc8874e8b92adaf1d453b26ef026
I updated from NodeBB 1783a07 to e99d952 during the cooties but they didn't stop until I disabled that function.
Here are some snippets from IRC:
13:54 < BenLubar> 29/3 18:53 [39] - warn: [socket.io] slow callback - 1044732ms
- uid: [redacted] - ip: [redacted] - event: notifications.get
- params: null - err: null
...
14:57 < BenLubar> 1174710ms for notifications.get
Odd, if it was a crash from that bug, it would've been fixed in more recent commits.
Ah, I hadn't read the stack traces in that topic very closely. It is indeed a separate bug.
I do not know if the notifications thing was the problem or the symptom, but the continued performance difficulties would tend to indicate that it was merely the symptom. For example, I'm currently seeing extremely long times to load small topics. Eventually I get a 504 Gateway Timeout
sometimes, or the topic loads. It's apparently arbitrary which happens.
Hunting performance problems can be hard. The only way to do it is to keep on improving the instrumentation you're applying to service calls in the hope of catching the trouble red-handed.
Do we know if the storm affected all 4 NodeBB instances, or just specific ones? If it affected all 4 instances simultaneously, that would tend to suggest a single point of failure, like Mongo db.
If the slowdown was in the Node backend, I'd expect individual instances to suffer while other instances are still OK.
Do we have any way of profiling the database and the node instances to see what's going on?
It'd be useful to profile the hosts to see if CPU, Memory, Disk access or Network bandwidth is being saturated.
Search may also be a factor. I remember it being very slow yesterday and at least one user reported cootie storms when he started searching.
Is everything one one t2.medium
? May be a better idea to put the database elsewhere if it isn't already.
Timetable on enabling notification dropdown?
Is there an alternative?
It looks like we need more disk IO operations per second allocated.
ouch, yeah that disc queue length is.... off the charts. ideally you want that sucker sollidly under 1.0, we seem to be north of 10 regularly.... no wonder we have issues!
@llouviere alternative for notifications?
i think i can come up with something...
Forums seem to be completely offline. SSL error over HTTPS and 404 over HTTP. Maintenance, I assume?
I stopped the AWS instance so I could snapshot the disk. Currently about 50% done. I hope the IP won't change when it comes back up.
We could just use GitHub for our forums.
Pretty good reliability here.
=-o
@BenLubar Any chance you can just use nginx to send Googlebot a 503 temporarily?
Edit: Maybe they'll respond to an HTTP 429
I've rate limited anyone with a user-agent matching /(bot|spider|slurp|crawler)/i to 1 request every 10 seconds (per IP) in nginx.
Looks like the rate limiting handled the bots and the better disk handled the large number of humans.
And another one; this one has been about an hour long so far (and started about 5 minutes before I attempted to use the site).
Does this need opening as a new issue so it gets noticed? I think it was possibly a bit optimistic closing it in the first place.
The site seems to have been down for quite some time now.
(cc @BenLubar)
I just woke up and the site seems fine right now. Here's a graph:
so.... what happened between 0230 and 0440 then?
so.... what happened between 0230 and 0440 then?
Could it be backups? Maintenence plans (log file purging, etc)? Software updates?
could be any of them. i asked because i don't know what it was and want to.
:-P
It hit again this morning (and is still ongoing as I write this) commencing at about 07:50 GMT. The nginx front-end seems to be sometimes fast to respond and sometimes not, like there's a resource exhaustion problem, but the back-end is solidly not responsive.
Also, I assume you have read http://redis.io/topics/lru-cache about maxmemory
tuning? If not, you really need to as redis defaults to being a memory hog.
maxmemory
has been set to 100mb
and maxmemory-policy
to allkeys-lru
since before this last set of cooties started.
@BenLubar output of ss -s
? Run mongostat
and see if there are items in the mongo write/read queue...
The site is very down and has been so for a substantial amount of time. This would have been a non-reportable SNAFU under Discourse, but with NodeBB it is indicative of a serious problem.
No idea what is wrong.