Closed sopel closed 11 years ago
While investigating, a New Relic alert triggered as well, immediately pointing out the underlying problem:
Fullest disk > 70% Today, 8:56 6 minutes In-progress
The alert triggered for logsearch-ppe-es-ebs-n0
, however, logsearch-ppe-es-ephemeral-n0
is equally affected. I can still log into the systems, but haven't identified the presumably offending log yet.
The disk space exhaustion has been caused by exploding java heap dumps, e.g. for logsearch-ppe-es-ebs-n0
:
/app/app$ ll -h
total 5.1G
drwxr-xr-x 8 ubuntu ubuntu 4.0K Aug 14 06:53 ./
drwxr-xr-x 6 ubuntu ubuntu 4.0K Aug 6 13:16 ../
...
-rw------- 1 ubuntu ubuntu 1.1G Jul 26 17:45 java_pid1543.hprof
-rw------- 1 ubuntu ubuntu 4.1G Aug 14 06:55 java_pid868.hprof
...
Please note that there has been a previous dump as well, which is not the case for logsearch-ppe-es-ephemeral-n0
:
/app/app$ ll -h
total 5.8G
drwxr-xr-x 8 ubuntu ubuntu 4.0K Aug 14 06:54 ./
drwxr-xr-x 6 ubuntu ubuntu 4.0K Aug 5 16:34 ../
...
-rw------- 1 ubuntu ubuntu 5.8G Aug 14 06:56 java_pid859.hprof
...
Not surprisingly, the CPUs have been pretty much maxed out too by the resp. java processes.
I've moved these files to /mnt
for eventual analysis and rebooted those instances, after which everything seems to be normal again.
@dpb587 - over to you for root cause analysis ;)
To clarify Kibana suddenly just crashed a bit:
http://logsearch.cityindextest5.co.uk/#/dashboard/elasticsearch/
in my history, which still exhibits said behavior (not sure how I ended up with that URL either in case)Context for the heap dump:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid868.hprof ...
Heap dump file created [4338445920 bytes in 113.665 secs]
[2013-08-14 06:56:20,119][WARN ][transport ] [Joe Cartelli] Received response for a request that has timed out, sent [71730ms] ago, timed out [41498ms] ago, action [discovery/zen/fd/ping], node [[Shadowcat][gQIa-d_qQ0iDvi1PIqDXkQ][inet[/10.34.184.97:9300]]], id [69566]
[2013-08-14 06:56:20,131][WARN ][transport ] [Joe Cartelli] Received response for a request that has timed out, sent [40025ms] ago, timed out [8238ms] ago, action [discovery/zen/fd/ping], node [[Shadowcat][gQIa-d_qQ0iDvi1PIqDXkQ][inet[/10.34.184.97:9300]]], id [69575]
[2013-08-14 07:00:01,310][WARN ][cluster.action.shard ] [Joe Cartelli] received shard failed for [logstash-2013.08.14][1], node[gQIa-d_qQ0iDvi1PIqDXkQ], [P], s[STARTED], reason [engine failure, message [OutOfMemoryError[Java heap space]]]
Closed as Incomplete due to root cause not being identified, but possible impact being reduced via #141.
After looking into #133 and basically just flipping the time filter back and forth on the search
@tags:"_grokparsefailure"
, i.e. zooming in and out of the resultset (over a maximum window of 7 days), Kibana suddenly just crashed and hasn't come back yet - more details shortly.