Fix moderate Kibana usage apparently crashing Kibana

cityindex-attic / logsearch

[unmaintained] A development environment for ELK

Apache License 2.0

24 stars 8 forks source link

Fix moderate Kibana usage apparently crashing Kibana #134

Closed sopel closed 11 years ago

sopel commented 11 years ago

After looking into #133 and basically just flipping the time filter back and forth on the search @tags:"_grokparsefailure", i.e. zooming in and out of the resultset (over a maximum window of 7 days), Kibana suddenly just crashed and hasn't come back yet - more details shortly.

sopel commented 11 years ago

While investigating, a New Relic alert triggered as well, immediately pointing out the underlying problem:

Fullest disk > 70%  Today, 8:56 6 minutes   In-progress

The alert triggered for logsearch-ppe-es-ebs-n0, however, logsearch-ppe-es-ephemeral-n0 is equally affected. I can still log into the systems, but haven't identified the presumably offending log yet.

sopel commented 11 years ago

The disk space exhaustion has been caused by exploding java heap dumps, e.g. for logsearch-ppe-es-ebs-n0:

/app/app$ ll -h
total 5.1G
drwxr-xr-x 8 ubuntu ubuntu 4.0K Aug 14 06:53 ./
drwxr-xr-x 6 ubuntu ubuntu 4.0K Aug  6 13:16 ../
...
-rw------- 1 ubuntu ubuntu 1.1G Jul 26 17:45 java_pid1543.hprof
-rw------- 1 ubuntu ubuntu 4.1G Aug 14 06:55 java_pid868.hprof
...

Please note that there has been a previous dump as well, which is not the case for logsearch-ppe-es-ephemeral-n0:

/app/app$ ll -h
total 5.8G
drwxr-xr-x 8 ubuntu ubuntu 4.0K Aug 14 06:54 ./
drwxr-xr-x 6 ubuntu ubuntu 4.0K Aug  5 16:34 ../
...
-rw------- 1 ubuntu ubuntu 5.8G Aug 14 06:56 java_pid859.hprof
...

Not surprisingly, the CPUs have been pretty much maxed out too by the resp. java processes.

I've moved these files to /mnt for eventual analysis and rebooted those instances, after which everything seems to be normal again.

@dpb587 - over to you for root cause analysis ;)

sopel commented 11 years ago

To clarify Kibana suddenly just crashed a bit:

adjusting a filter suddenly rendered an empty but themed page
refreshing the page yielded the same with an HTTP status of 200
likewise starting from scratch, including another browser
at that point I started investigating elsewhere (Librato, New Relic, SSH)
- :exclamation: meanwhile I'm not 100% positive anymore whether this doesn't include a driver error regarding the used URL, because I now see http://logsearch.cityindextest5.co.uk/#/dashboard/elasticsearch/ in my history, which still exhibits said behavior (not sure how I ended up with that URL either in case)

dpb587 commented 11 years ago

Context for the heap dump:

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid868.hprof ...
Heap dump file created [4338445920 bytes in 113.665 secs]
[2013-08-14 06:56:20,119][WARN ][transport                ] [Joe Cartelli] Received response for a request that has timed out, sent [71730ms] ago, timed out [41498ms] ago, action [discovery/zen/fd/ping], node [[Shadowcat][gQIa-d_qQ0iDvi1PIqDXkQ][inet[/10.34.184.97:9300]]], id [69566]
[2013-08-14 06:56:20,131][WARN ][transport                ] [Joe Cartelli] Received response for a request that has timed out, sent [40025ms] ago, timed out [8238ms] ago, action [discovery/zen/fd/ping], node [[Shadowcat][gQIa-d_qQ0iDvi1PIqDXkQ][inet[/10.34.184.97:9300]]], id [69575]
[2013-08-14 07:00:01,310][WARN ][cluster.action.shard     ] [Joe Cartelli] received shard failed for [logstash-2013.08.14][1], node[gQIa-d_qQ0iDvi1PIqDXkQ], [P], s[STARTED], reason [engine failure, message [OutOfMemoryError[Java heap space]]]

sopel commented 11 years ago

Closed as Incomplete due to root cause not being identified, but possible impact being reduced via #141.