CIF-Router stopping unexpectedly

coonsmatthew commented 8 years ago

Hello,

I'm having issues with what appears to be my elasticsearch database. After several hours of CIF running successfully, when I try to run cif -p I get:

$ cif -p [2016-05-31T07:47:35,458Z][WARN]: unauthorized [2016-05-31T07:47:35,459Z][FATAL]: unauthorized

So I tailed the cif-router.log and noticed these lines (I just pasted a few of the many):

[2016-05-31T08:29:40,408Z][2531][ERROR]: IndexFailedEngineException[[cif.observables-2016.05.31][2] Index failed for [observables#6253bb3c59c2fed14b0e253855996b8d50a090ff6399cc36b877bdd5d2db5ff5]]; nested: AlreadyClosedException[this IndexWriter is closed]; nested: OutOfMemoryError[Java heap space]; [2016-05-31T08:29:40,408Z][2531][ERROR]: UnavailableShardsException[[cif.observables-2016.05.31][3] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1489755d] [2016-05-31T08:29:40,408Z][2531][ERROR]: UnavailableShardsException[[cif.observables-2016.05.31][3] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1489755d] [2016-05-31T08:29:40,409Z][2531][ERROR]: UnavailableShardsException[[cif.observables-2016.05.31][3] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1489755d] [2016-05-31T08:29:40,409Z][2531][ERROR]: UnavailableShardsException[[cif.observables-2016.05.31][3] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@1489755d] [2016-05-31T08:29:40,463Z][2531][ERROR]: IndexFailedEngineException[[cif.observables-2016.05.31][2] Index failed for [observables#d148f17ff76d7b61a8065907a05f77ffbe0a648d53b93031360259720a82f08e]]; nested: AlreadyClosedException[this IndexWriter is closed]; nested: OutOfMemoryError[Java heap space];

coonsmatthew commented 8 years ago

Also, stopping cif-services and then elasticsearch services and then re-starting them appears to fix the issue for a few hours.

coonsmatthew commented 8 years ago

I'm noticing this error in CIF-SMRT as well:

2016-05-31T14:02:29,776Z][30347][ERROR]: [Timeout] ** [http://localhost:9200]-[599] Timed out while waiting for socket to become ready for reading, called from sub Search::Elasticsearch::Transport::try {...} at /usr/share/perl5/Try/Tiny.pm [observable-data]

However, when I check the elasticsearch service, it says it is running.

coonsmatthew commented 8 years ago

I'm going to try restarting the server and see if that helps at all.

elasticsearch log says:

[2016-05-31 12:57:38,855][DEBUG][action.bulk ] [Angela Del Toro] [cif.observables-2016.05.31][3] failed to execute bulk item (index) index {[cif.observables-2016.05.31][observables]} org.elasticsearch.index.engine.IndexFailedEngineException: [cif.observables-2016.05.31][3] Index failed for [observables#number] at org.elasticsearch.index.engine.internal.InternalEngine.index(InternalEngine.java:530) at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:458) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:427) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:149) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:700) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:714) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1522) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1252) at org.elasticsearch.index.engine.internal.InternalEngine.innerIndex(InternalEngine.java:594) at org.elasticsearch.index.engine.internal.InternalEngine.index(InternalEngine.java:522) ... 8 more Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.ParallelPostingsArray.<init>(ParallelPostingsArray.java:35) at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:210) at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:235) at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48) at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252) at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:663) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1526) ... 11 more

giovino commented 8 years ago

Matt,

Are you putting so much data into this that you are hitting some natural memory limits or do you believe there is a memory leak? We are going to need to collect more data. I believe the logs OutOfMemoryError. The question is why?

coonsmatthew commented 8 years ago

Gabe,

We're manually submitting probably 1-2 observables a day ourselves. I just modified our CIF configuration so it is only pulling in the default feeds it ships with, any additional feeds I added have been disabled.

We will see if I notice any issues over the evening.

Also, when I check the free memory on the server, I have plenty and I am not swapping at all. So I'm not sure why Elasticsearch thinks it is out of memory space. I just checked the server performance and I'm only using 6 gig of ram (out of 16). Which is super weird, because usually elasticsearch seems to use almost all of the available memory.

Thank You.

giovino commented 8 years ago

I just stood up a brand new instance on aws with 16GB of ram with all the default feeds enabled in rc15 to try to duplicate this issue or other potential memory leaks. I'll let this run for a week or so to see if I can duplicate any issues.
We purposefully do not do much ElasticSearch configuration as it is hard without knowing what the machine resources look like. If you've done zero elasticsearch tuning then ES should be configure to use the default 1G of memory (ES_HEAP_SIZE). You can see that with:

head /etc/default/elasticsearch 
# Run Elasticsearch as this user ID and group ID
#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

# Heap Size (defaults to 256m min, 1g max)
#ES_HEAP_SIZE=2g

If when you start to get into ES tuning, I found these resources helpful:

http://edgeofsanity.net/article/2012/12/26/elasticsearch-for-logging.html http://www.wilfred.me.uk/blog/2015/01/31/taming-a-wild-elasticsearch-cluster

wesyoung commented 8 years ago

additionally, my guess is, the additional feeds are what's chewing up the mem, along with tuning you might have to also boost the ram to 32g (just a hunch) and/or start thinking about breaking things up (ie: sep es instances into a 3-node env, etc). we have some helper doc in the wings we can try and guide you through, but i'd start with doing some tuning as Gabe suggested, then adding more memory for those additional feeds, then thinking through what a 3-node setup might look like...

coonsmatthew commented 8 years ago

Ok, thank you for both of your comments.

The ES errors seem to have self-healed, after I turned off the additional feeds that I was pulling in. I will look into modifying the memory on our server.

I checked the cif-smrt log and don't see any 500 errors for the last data pull. Additionally, I checked our ES config, and it does appear to be set to the defaults, so that is good.

I will watch for any additional errors going forward.

wesyoung commented 8 years ago

yea, out of the box you can do 16g, but with additional "heavy" feeds, you can very quickly get to 32g. past that, you're looking to split into 3-node (4-8G each?) clusters.. they can be smaller because of the way ES works, but you're also splitting out cif-router which also uses some memory (right now, our production is 3 es clusters and 2 cif-routers which are 8G each, if that gives you some idea).

csirtgadgets / massive-octo-spice

CIF-Router stopping unexpectedly #427