Fix partial service disruption

sopel commented 11 years ago

There's an ongoing incident, with quite seem details to analyze/summarize - this issue only serves as an anchor for the communication going on elsewhere and will be updated with post mortem information.

dpb587 commented 11 years ago

I think the main issue we were having was insufficient RAM and trying to search too much data - @sopel noticed high Garbage Collection CPU usage via New Relic during the problematic times.

The first issues started happening at 2013-09-09 03:53:03:

java.lang.OutOfMemoryError: Java heap space
Dumping heap to /app/tmp/heap-dump/java_pid814.hprof ...
Heap dump file created [3739749765 bytes in 76.798 secs]
[2013-09-09 03:53:03,029][WARN ][transport                ] [Midnight Man] Received response for a request that has timed out, sent [51556ms] ago, timed out [21487ms] ago, action [discovery/zen/fd/masterPing], node [[Threnody][0vpntxT4QJu9qQ3j6GEqkA][inet[/10.239.91.215:9300]]], id [47333]

The heap dump is quite full and is still sitting out on ephemeral /mnt if we're curious:

$ ls -al /app/tmp/heap-dump/
total 3655680
drwxr-xr-x 2 ubuntu ubuntu       4096 Sep  9 03:50 .
drwxr-xr-x 5 root   root         4096 Aug 27 17:34 ..
-rw------- 1 ubuntu ubuntu 3739749765 Sep  9 03:52 java_pid814.hprof

The EBS node also had a heap dump at approximately the same time, although it mostly recovered, had no further errors, and continued to receive new log events. The ephemeral continued to whine and error causing the kibana frontend to be mostly unusable.

Eventually I restarted the elasticsearch services and they came back up successfully. I wanted to push the nodes, so I used kibana to slowly zoom out, showing more data. The ephemeral node started showing java.lang.OutOfMemoryError after getting past ~100m events. After that confirmed failure I just ended up restarting the instances completely and they came back up successfully.

I think the most appropriate solution is to use instances with a bit more memory. With little professional experience for the thought, it seems fairly logical that we need more than ~3.5 GB RAM for ~100m documents. Whatever the case, tomorrow we should re-create the cluster with more memory-capable elasticsearch instances. The other option is to scale horizontally and add more elasticsearch nodes to spread the search load.

In the meantime while researching, I learned about the hot threads functionality. It's supposed to work with the entire cluster via /_nodes/hot_threads, but I think the transport was still in failure mode and it just hung. Here's a snippet of the sample of the node:

$ curl 10.228.175.48:9200/_nodes/-TyfpmNNQQ6FqnAnlNLxPQ/hot_threads?pretty
::: [Vivisector][-TyfpmNNQQ6FqnAnlNLxPQ][inet[ip-10-228-175-48.eu-west-1.compute.internal/10.228.175.48:9300]]

    8.3% (41.6ms out of 500ms) cpu usage by thread 'elasticsearch[Vivisector][search][T#29]'
     5/10 snapshots sharing following 21 elements
       org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetExecutor$Collector.collect(TermsStringOrdinalsFacetExecutor.java:220)
       org.elasticsearch.common.lucene.search.FilteredCollector.collect(FilteredCollector.java:62)
       org.elasticsearch.common.lucene.MultiCollector.collect(MultiCollector.java:60)
       org.apache.lucene.search.Scorer.score(Scorer.java:65)
       org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.score(ConstantScoreQuery.java:245)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)
       org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:161)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:482)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:438)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:281)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:269)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:134)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:239)
       org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:141)
       org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
       org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:206)
       org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:193)
       org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2.run(TransportSearchTypeAction.java:179)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:724)
     5/10 snapshots sharing following 15 elements
       org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:161)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:482)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:438)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:281)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:269)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:134)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:239)
       org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:141)
       org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
       org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:206)
       org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:193)
       org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2.run(TransportSearchTypeAction.java:179)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:724)

...snip...

sopel commented 11 years ago

Thanks for the detailed analysis - https://github.com/cityindex/logsearch-on-aws/issues/14 yields the decision to scale up to the next instance type for now indeed (effectively doubling the amount of available memory) and to look into scaling out by splitting the shards further the road.

cityindex-attic / logsearch

Fix partial service disruption #173