TheHive-Project / Cortex

Cortex: a Powerful Observable Analysis and Active Response Engine
https://strangebee.com/cortex/
GNU Affero General Public License v3.0
1.34k stars 230 forks source link

Cortex - Excessive resource usage leading to crash #214

Open cameronkinsel opened 5 years ago

cameronkinsel commented 5 years ago

Hello all!

We have a Linux 16.04 headless server which has been running TheHive\Cortex for over a year now with the following specs: 4 CPU cores, 16gb RAM, 400gb storage

Software:

ckinsel@hiveapp1:~$ sudo dpkg -l | grep thehive
[sudo] password for ckinsel: 
ii  thehive                               3.3.1-1                                    all          Scalable, Open Source and Free Security Incident Response Solutions
ckinsel@hiveapp1:~$ sudo dpkg -l | grep cortex
ii  cortex                                2.1.3-1                                    all          Powerful Observable Analysis Engine
ckinsel@hiveapp1:~$ sudo dpkg -l | grep elasticsearch
hi  elasticsearch                         5.6.16                                     all          Elasticsearch is a distributed RESTful search engine built for the cloud. Reference documentation can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html and the 'Elasticsearch: The Definitive Guide' book can be found at https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

Problem: After running for a while (roughly once per 48 hrs) Cortex.service will randomly utilize all resources on the server (See screenshot). This causes thehive.service and cortex.service to report failure in Systemctl, and the webapp goes down. Additionally, 'sudo systemctl restart cortex.service' hangs, and requires the VM to be reset to get it out of the stuck state.

Evidence: TheHive and Cortex logs (/var/log/thehive/application.log and /var/log/cortex/application.log) both contain many Java OutOfMemory errors: Thehive:

2019-08-10 12:20:41,983 [ERROR] from akka.actor.ActorSystemImpl in application-akka.actor.default-dispatcher-5 - Uncaught error from thread [application-akka.actor.default-dispatcher-20]: GC overhead limit exceeded, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[application]

java.lang.OutOfMemoryError: GC overhead limit exceeded

Cortex:

2019-08-10 08:11:15,922 [WARN] from org.elasticsearch.transport.TransportService in elasticsearch[_client_][transport_client_boss][T#2] - Received response for a request that has timed out, sent [47381ms] ago, timed out [26291ms] ago, action [cluster:monitor/nodes/liveness], node [{#transport#-1}{OP-6tmJnT--3isBtUjB4NA}{127.0.0.1}{127.0.0.1:9300}], id [39064]

2019-08-10 08:13:30,562 [INFO] from play.core.server.AkkaHttpServer in Thread-6 - Stopping server...

2019-08-10 08:14:16,555 [ERROR] from akka.actor.OneForOneStrategy in application-akka.actor.default-dispatcher-9 - head of empty list

java.util.NoSuchElementException: head of empty listsud

2019-08-10 08:51:53,099 [ERROR] from akka.actor.ActorSystemImpl in application-akka.actor.default-dispatcher-8 - Uncaught error from thread [application-akka.actor.default-dispatcher-9]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[application]

java.lang.OutOfMemoryError: Java heap space

Systemctl:

ckinsel@hiveapp1:~$ sudo systemctl status thehive.service

[sudo] password for ckinsel: 

   thehive.service - TheHive

   Loaded: loaded (/usr/lib/systemd/system/thehive.service; enabled; vendor preset: enabled)

   Active: failed (Result: exit-code) since Sat 2019-08-10 12:20:49 EDT; 1 day 20h ago

     Docs: https://thehive-project.org

  Process: 1303 ExecStart=/opt/thehive/bin/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logback.xml -Dpidfile.path=/dev/null (code=exited, status=255)

 Main PID: 1303 (code=exited, status=255)

Aug 09 08:36:20 hiveapp1 systemd[1]: Started TheHive.

Aug 10 12:20:49 hiveapp1 systemd[1]: thehive.service: Main process exited, code=exited, status=255/n/a

Aug 10 12:20:49 hiveapp1 systemd[1]: thehive.service: Unit entered failed state.

Aug 10 12:20:49 hiveapp1 systemd[1]: thehive.service: Failed with result 'exit-code'.

ckinsel@hiveapp1:~$ sudo systemctl status cortex.service

  cortex.service - cortex

   Loaded: loaded (/etc/systemd/system/cortex.service; enabled; vendor preset: enabled)

   Active: active (running) since Fri 2019-08-09 08:36:20 EDT; 3 days ago

     Docs: https://thehive-project.org

 Main PID: 1311 (java)

    Tasks: 30

   Memory: 3.4G

      CPU: 22h 54min 58.281s

   CGroup: /system.slice/cortex.service

           └─1311 java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/application.conf -Dlogger.file=/etc/cortex/logback.xml -Dpidfile.path=/dev/null -cp /opt/cortex/lib/../conf/:/opt/cortex/lib/org.thehive-project.cortex-2.1.3-1-sans-externalized.jar:/opt/cortex/lib/

Aug 09 08:36:20 hiveapp1 systemd[1]: Started cortex.

Journalctl:

Aug 10 12:20:49 hiveapp1 systemd[1]: thehive.service: Main process exited, code=exited, status=255/n/a

Aug 10 12:20:49 hiveapp1 systemd[1]: thehive.service: Unit entered failed state.

Aug 10 12:20:49 hiveapp1 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=thehive comm="systemd" exe="/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'

Aug 10 12:20:49 hiveapp1 systemd[1]: thehive.service: Failed with result 'exit-code'.

Aug 10 12:20:49 hiveapp1 kernel: audit_printk_skb: 57 callbacks suppressed

Aug 10 12:20:49 hiveapp1 kernel: audit: type=1131 audit(1565454049.464:6787): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=thehive comm="systemd" exe="/lib/s

Recovery/Support After rebooting the VM, all services start with no issues, the CPU rests below 3%, and RAM below 4gb. This issue sounds like a potential memory leak, but I have not noticed it steadily climbing, and I don't think this explains the 100% CPU utilization. I'd consider our implementation pretty standard, but haven't found others reporting the same issue here or on github. I cannot identify when exactly the issue started, as we've not (knowingly) applied changes to the server which would cause this behavior, and it has been ongoing for roughly a week.

At this point I'm not exactly sure where to look next for more logs, or for a source for the issue. If anyone has any ideas, please let me know!

cameronkinsel commented 5 years ago

Screen Shot 2019-08-06 at 1 16 48 PM

ITServ-DE commented 5 years ago

on my Hive/Cortex/MISP-Server, I only have three running Java processes. One for TheHive, one for Cortex, one for Elasticsearch.

When I get your screenshot right, you have multiple Cortex/Hive instances running simultaniously. For example, I can quickly identify six Cortex: PIDs 1270, 1805, 1806, 1808 and 1817.

That does not seem right, and that would be the first thing I would look at.

I would run systemctl disable thehive; systemctl disable cortex to disable starting and stopping and reboot the server. When the host comes up again, I'd make sure no thehive/cortex is running.

Then, I would manually start cortex and thehive with /etc/init.d/thehive start; /etc/init.d/cortex start and check that only one instance (=java process) is running for each of them.

That would be my starting point.

From that point, I would start tracking the number of java processes on the system. Maybe some cron job tries to restart thehive/cortex, but accidently only starts new processes which over the time eat up your memory.

ITServ-DE commented 5 years ago

For completeness, here is what runs on my host:

root@hive:~# ps -eo "pid user args" | grep java | grep -v grep| cut -c1-120            
 9514 thehive  java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logba                                                                                           
31875 cortex   java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/application.conf -Dlogger.file=/etc/cortex/logback.                                                                                           
32418 elastic+ /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInit                                                                                           

Nice and clean, one java instance for each of them.

cameronkinsel commented 5 years ago

@github-pba Thanks for the reply!

Yesterday we ran the server with cortex.service stopped, and came in this morning with no failures. If it is something as you mentioned, where multiple instances are being executed, I believe cortex.service is the culprit.

The output I had generated before was using htop. When I run your 'ps -eo' command I get the same result as you:

1348 elastic+ /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInit
25670 cortex   java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/application.conf -Dlogger.file=/etc/cortex/logback.
28670 thehive  java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logba

However, at the time I executed the above, I was not seeing the resource issue. I will check throughout today, and if we notice the excessive resource usage, I'll run it again and compare here.

ITServ-DE commented 5 years ago

You could monitor your resource consumption by periodically calling ps -eo "vsz rss args" and do some magic with the output (aka sort it). vsz is the virtual process size, rss the resident set size, meaning how much memory is physically allocated.

By tracking these, you should be able to determine if a process grows.

cameronkinsel commented 5 years ago

With more digging it seems like my understanding of htop is flawed because even now (when I'm not having an issue,) I'm seeing many unique PIDs for Java, but when I use 'top' or 'ps -ef' I'm only seeing one per cortex\elastic\thehive.

Interestingly, the 'command' for cortex\thehive through 'ps' is insanely long. The below is one PID:

ckinsel@hiveapp1:~$ps -ef | grep java

cortex   25670     1  3 07:56 ?        00:00:39 java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/application.conf -Dlogger.file=/etc/cortex/logback.xml -Dpidfile.path=/dev/null -cp /opt/cortex/lib/../conf/:/opt/cortex/lib/org.thehive-project.cortex-2.1.3-1-sans-externalized.jar:/opt/cortex/lib/org.scala-lang.scala-library-2.12.7.jar:/opt/cortex/lib/com.typesafe.play.twirl-api_2.12-1.3.15.jar:/opt/cortex/lib/org.scala-lang.modules.scala-xml_2.12-1.0.6.jar:/opt/cortex/lib/com.typesafe.play.play-server_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.build-link-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-exceptions-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-netty-utils-2.6.20.jar:/opt/cortex/lib/org.slf4j.slf4j-api-1.7.25.jar:/opt/cortex/lib/org.slf4j.jul-to-slf4j-1.7.25.jar:/opt/cortex/lib/org.slf4j.jcl-over-slf4j-1.7.25.jar:/opt/cortex/lib/com.typesafe.play.play-streams_2.12-2.6.20.jar:/opt/cortex/lib/org.reactivestreams.reactive-streams-1.0.2.jar:/opt/cortex/lib/com.typesafe.akka.akka-stream_2.12-2.5.17.jar:/opt/cortex/lib/com.typesafe.akka.akka-actor_2.12-2.5.17.jar:/opt/cortex/lib/com.typesafe.config-1.3.3.jar:/opt/cortex/lib/org.scala-lang.modules.scala-java8-compat_2.12-0.8.0.jar:/opt/cortex/lib/com.typesafe.akka.akka-protobuf_2.12-2.5.17.jar:/opt/cortex/lib/com.typesafe.ssl-config-core_2.12-0.2.4.jar:/opt/cortex/lib/org.scala-lang.modules.scala-parser-combinators_2.12-1.0.6.jar:/opt/cortex/lib/com.typesafe.akka.akka-slf4j_2.12-2.5.17.jar:/opt/cortex/lib/com.fasterxml.jackson.core.jackson-core-2.8.11.jar:/opt/cortex/lib/com.fasterxml.jackson.core.jackson-annotations-2.8.11.jar:/opt/cortex/lib/com.fasterxml.jackson.datatype.jackson-datatype-jdk8-2.8.11.jar:/opt/cortex/lib/com.fasterxml.jackson.core.jackson-databind-2.8.11.1.jar:/opt/cortex/lib/com.fasterxml.jackson.datatype.jackson-datatype-jsr310-2.8.11.jar:/opt/cortex/lib/commons-codec.commons-codec-1.10.jar:/opt/cortex/lib/com.typesafe.play.play-json_2.12-2.6.10.jar:/opt/cortex/lib/com.typesafe.play.play-functional_2.12-2.6.10.jar:/opt/cortex/lib/org.scala-lang.scala-reflect-2.12.7.jar:/opt/cortex/lib/org.typelevel.macro-compat_2.12-1.1.1.jar:/opt/cortex/lib/joda-time.joda-time-2.9.9.jar:/opt/cortex/lib/com.google.guava.guava-22.0.jar:/opt/cortex/lib/com.google.errorprone.error_prone_annotations-2.0.18.jar:/opt/cortex/lib/com.google.j2objc.j2objc-annotations-1.1.jar:/opt/cortex/lib/org.codehaus.mojo.animal-sniffer-annotations-1.14.jar:/opt/cortex/lib/io.jsonwebtoken.jjwt-0.7.0.jar:/opt/cortex/lib/javax.xml.bind.jaxb-api-2.3.0.jar:/opt/cortex/lib/org.apache.commons.commons-lang3-3.6.jar:/opt/cortex/lib/javax.transaction.jta-1.1.jar:/opt/cortex/lib/javax.inject.javax.inject-1.jar:/opt/cortex/lib/com.typesafe.play.filters-helpers_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-logback_2.12-2.6.20.jar:/opt/cortex/lib/ch.qos.logback.logback-classic-1.2.3.jar:/opt/cortex/lib/ch.qos.logback.logback-core-1.2.3.jar:/opt/cortex/lib/com.typesafe.play.play-akka-http-server_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.akka.akka-http-core_2.12-10.0.14.jar:/opt/cortex/lib/com.typesafe.akka.akka-parsing_2.12-10.0.14.jar:/opt/cortex/lib/org.apache.logging.log4j.log4j-to-slf4j-2.9.1.jar:/opt/cortex/lib/com.typesafe.play.play-ehcache_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-cache_2.12-2.6.20.jar:/opt/cortex/lib/net.sf.ehcache.ehcache-2.10.4.jar:/opt/cortex/lib/org.ehcache.jcache-1.0.1.jar:/opt/cortex/lib/javax.cache.cache-api-1.0.0.jar:/opt/cortex/lib/com.typesafe.play.play-ws_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-ws-standalone_2.12-1.1.10.jar:/opt/cortex/lib/com.typesafe.play.play-ws-standalone-xml_2.12-1.1.10.jar:/opt/cortex/lib/com.typesafe.play.play-ws-standalone-json_2.12-1.1.10.jar:/opt/cortex/lib/com.typesafe.play.play-guice_2.12-2.6.20.jar:/opt/cortex/lib/com.google.inject.guice-4.1.0.jar:/opt/cortex/lib/aopalliance.aopalliance-1.0.jar:/opt/cortex/lib/com.google.inject.extensions.guice-assistedinject-4.1.0.jar:/opt/cortex/lib/net.codingwell.scala-guice_2.12-4.1.0.jar:/opt/cortex/lib/com.google.inject.extensions.guice-multibindings-4.1.0.jar:/opt/cortex/lib/com.google.code.findbugs.jsr305-3.0.1.jar:/opt/cortex/lib/org.thehive-project.elastic4play_2.12-1.7.2.jar:/opt/cortex/lib/com.typesafe.play.play-akka-http2-support_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.akka.akka-http2-support_2.12-10.0.14.jar:/opt/cortex/lib/com.twitter.hpack-1.0.2.jar:/opt/cortex/lib/org.eclipse.jetty.alpn.alpn-api-1.1.3.v20160715.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-core_2.12-5.6.6.jar:/opt/cortex/lib/com.sksamuel.exts.exts_2.12-1.44.0.jar:/opt/cortex/lib/org.typelevel.cats_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-macros_2.12-0.9.0.jar:/opt/cortex/lib/com.github.mpilquist.simulacrum_2.12-0.10.0.jar:/opt/cortex/lib/org.typelevel.machinist_2.12-0.6.1.jar:/opt/cortex/lib/org.typelevel.cats-kernel_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-kernel-laws_2.12-0.9.0.jar:/opt/cortex/lib/org.scalacheck.scalacheck_2.12-1.13.4.jar:/opt/cortex/lib/org.scala-sbt.test-interface-1.0.jar:/opt/cortex/lib/org.typelevel.discipline_2.12-0.7.2.jar:/opt/cortex/lib/org.typelevel.catalysts-platform_2.12-0.0.5.jar:/opt/cortex/lib/org.typelevel.catalysts-macros_2.12-0.0.5.jar:/opt/cortex/lib/org.typelevel.cats-core_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-laws_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-free_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-jvm_2.12-0.9.0.jar:/opt/cortex/lib/org.apache.lucene.lucene-core-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-analyzers-common-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-backward-codecs-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-grouping-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-highlighter-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-join-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-memory-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-misc-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-queries-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-queryparser-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-sandbox-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-spatial-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-spatial-extras-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-spatial3d-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-suggest-6.6.1.jar:/opt/cortex/lib/net.sf.jopt-simple.jopt-simple-5.0.2.jar:/opt/cortex/lib/com.carrotsearch.hppc-0.7.1.jar:/opt/cortex/lib/org.yaml.snakeyaml-1.15.jar:/opt/cortex/lib/com.fasterxml.jackson.dataformat.jackson-dataformat-smile-2.8.6.jar:/opt/cortex/lib/com.fasterxml.jackson.dataformat.jackson-dataformat-yaml-2.8.6.jar:/opt/cortex/lib/com.fasterxml.jackson.dataformat.jackson-dataformat-cbor-2.8.6.jar:/opt/cortex/lib/org.hdrhistogram.HdrHistogram-2.1.9.jar:/opt/cortex/lib/org.apache.logging.log4j.log4j-api-2.9.1.jar:/opt/cortex/lib/org.elasticsearch.jna-4.4.0-1.jar:/opt/cortex/lib/org.locationtech.spatial4j.spatial4j-0.6.jar:/opt/cortex/lib/com.vividsolutions.jts-1.13.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-streams_2.12-5.6.6.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-tcp_2.12-5.6.6.jar:/opt/cortex/lib/io.netty.netty-all-4.1.10.Final.jar:/opt/cortex/lib/org.elasticsearch.client.transport-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.plugin.transport-netty3-client-5.6.2.jar:/opt/cortex/lib/io.netty.netty-3.10.6.Final.jar:/opt/cortex/lib/io.netty.netty-buffer-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-codec-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-codec-http-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-common-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-handler-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-resolver-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-transport-4.1.13.Final.jar:/opt/cortex/lib/org.elasticsearch.plugin.reindex-client-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.client.elasticsearch-rest-client-5.6.2.jar:/opt/cortex/lib/org.apache.httpcomponents.httpclient-4.5.2.jar:/opt/cortex/lib/org.apache.httpcomponents.httpcore-4.4.5.jar:/opt/cortex/lib/org.apache.httpcomponents.httpasyncclient-4.1.2.jar:/opt/cortex/lib/org.apache.httpcomponents.httpcore-nio-4.4.5.jar:/opt/cortex/lib/commons-logging.commons-logging-1.1.3.jar:/opt/cortex/lib/org.elasticsearch.plugin.lang-mustache-client-5.6.2.jar:/opt/cortex/lib/com.github.spullara.mustache.java.compiler-0.9.3.jar:/opt/cortex/lib/org.elasticsearch.plugin.percolator-client-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.plugin.parent-join-client-5.6.2.jar:/opt/cortex/lib/org.apache.logging.log4j.log4j-1.2-api-2.6.2.jar:/opt/cortex/lib/com.tdunning.t-digest-3.1.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-xpack-security_2.12-5.6.6.jar:/opt/cortex/lib/org.elasticsearch.client.x-pack-transport-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.plugin.x-pack-api-5.6.2.jar:/opt/cortex/lib/com.unboundid.unboundid-ldapsdk-3.2.0.jar:/opt/cortex/lib/org.bouncycastle.bcprov-jdk15on-1.58.jar:/opt/cortex/lib/org.bouncycastle.bcpkix-jdk15on-1.55.jar:/opt/cortex/lib/com.googlecode.owasp-java-html-sanitizer.owaspjava-html-sanitizer-r239.jar:/opt/cortex/lib/com.sun.mail.javax.mail-1.5.3.jar:/opt/cortex/lib/javax.activation.activation-1.1.jar:/opt/cortex/lib/org.elasticsearch.client.elasticsearch-rest-client-sniffer-5.6.2.jar:/opt/cortex/lib/net.sf.supercsv.super-csv-2.4.0.jar:/opt/cortex/lib/org.scalactic.scalactic_2.12-3.0.5.jar:/opt/cortex/lib/com.floragunn.search-guard-ssl-5.6.9-23.jar:/opt/cortex/lib/org.elasticsearch.plugin.transport-netty4-client-5.6.9.jar:/opt/cortex/lib/org.elasticsearch.elasticsearch-5.6.9.jar:/opt/cortex/lib/org.elasticsearch.securesm-1.2.jar:/opt/cortex/lib/org.reflections.reflections-0.9.11.jar:/opt/cortex/lib/org.javassist.javassist-3.21.0-GA.jar:/opt/cortex/lib/net.lingala.zip4j.zip4j-1.3.2.jar:/opt/cortex/lib/org.thehive-project.cortex-2.1.3-1-assets.jar play.core.server.ProdServerStart

I will run the 'continuous monitoring' command you've outlined above and post the details here after a few hours.

ITServ-DE commented 5 years ago

htop seems to mix up threads and processes. htop shows you each tread, ps only shows processes.

I were just reading http://ask.xmodulo.com/view-threads-process-linux.html on that issue.

ITServ-DE commented 5 years ago

To hide the processes threads in htop, press H

But what the hell does htop mean with uniq PIDs per thread ... stupid thing.

cameronkinsel commented 5 years ago

Yeah, but at this point I'd rather exclude evidence from htop as it seems to be manufacturing what we think are issues. This way the thread doesn't turn into 'how to use htop' haha

For now I'll continue to monitor resource usage over time using ps

ITServ-DE commented 5 years ago

Just an idea. using ps -T -o "vsz rss comm" -p {PID}, you can monitor the number of background processes as well as the memory consumption. You'd potentially have to run it twice, once with the PID of TheHive and once with Cortex (except you're sure it's a cortex problem).

Good luck tracing.

You could also attach jconsole to the jvm and run analysis there, but I do not know more about how to do this. I only now of the existance of that possibility.

ITServ-DE commented 5 years ago

Oh, and of course: track the output of free. This way you get the proof you run out of memory .

cameronkinsel commented 5 years ago

No failures, but do have a steady increase in RSS usage by Cortex, and a clear decrease in Available memory reported by 'free'

Time VSZ RSS Available Mem Note  
9:00 8019672 1158980 7989028 After running for hours
10:00 7942628 913624 11327040 post-reboot  
11:00 7949512 1224654 10123448    
12:00 7959724 1569408 9132816    
13:00 missed missed missed    
14:00 7959724 1802904 8798216    
15:00          
16:00          
17:00          

I'm going to let it keep going, and my expectation is that it will fail once the server runs out of memory, causing the services to crash

cameronkinsel commented 5 years ago

Last5dDy Last4Hr

See attached images. I have 2 graphs of 'free' data. 5 days(top) and 4 hours(bottom)

The 4hr image confirms a stead increase in memory usage, as well as cache.

The 5dy image shows a few service restarts, and a flatline where cortex was disabled, but Elastic+TheHive were both enabled, followed by a crash for multiple hours, then today's data.

I think this definitely confirms some sort of memory leak in Cortex.service. Now we just need to know how to prevent it.

ITServ-DE commented 5 years ago

Sorry, I disagree with you.

As long as your machine has free memory, it's totaly normal that (a) you cache grows and (b) rss tends to reach vsz. This is normal and expected.

From your graphs, especially the lower one, you can clearly see that used is stable, which means, the memory allocated by processes does not grow. Good. And you see that cache replaces free, which is normal for filesystem caching.

Free memory is used as buffer cache (filesystem blocks read or written are held in memory as long as possible to avoid re-reading them from the disk, which fills the buffer cache at the cost of free over the time the system is running).

And vsz is the virtual process size, whereas rss is the amount of vss that is actually kept in RAM. Ideally, vsz = rss, which means the entire virtual address space of the process is in memory.

Keep on measuring. You should expect free to shrink to a certain value and not beyond this. If your used is stable, you do not have a memory leak. If you had a leak, used would fill up the entire memory of the host, and then you would also see by tracking vsz, which process grows.

I hope that helps. Please provide memory charts (those are great!) at the time your system is in trouble.

BTW, I have some trouble with the upper chart, because it does not match the data from the lower one. Both cover the time frame from ~10.00 to 14.00, but they do not match. Do you have an explaination for that?

ITServ-DE commented 5 years ago

Thinking ahead to mitigate your problems.

The JVMs of thehive and cortex run the default settings for the heap size. These are set using the -Xms and -Xmx settings of a JVM (Source)

To determine the default settings, would you please run java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize' ?

and would you please also provide the settings configured for ElasticSearch? If you did'nt tweak the settings, you will probably have 2GB for ES:

root@hive:~# ps -ef|grep ^elastic                                                                                                                                                                                  
elastic+  1234     1  0 Aug13 ?        00:10:24 /usr/bin/java -Xms2g -Xmx2g [...]       

With this as a starting point, and assuming you have a dedicated thehive/cortex server, we could think about increasing the heap size for cortex by setting -Xmx to a higher value than default.

I think about [Host Memory] - 2 GB (for OS) - [ ES Memory] - [JVM default max heap size for TheHive], and take the rest for cortex.

But let me be clear on this right away: I'm no Java specialist, much more an OS guy. Tuning the JVM's memory is more an educated guess.

ITServ-DE commented 5 years ago

More questions: do you havily use cortex? How many analyzer runs do you do each day, what is the setting of cache.job in /etc/cortex/application.conf?

I think you might analyze heavily, with a long caching time (global or per analyzer). That would fill up cortex's memory. Decreasing the caching time then could help, as well as increasing the JVMs heap size as before)

cameronkinsel commented 5 years ago

TIL my working knowledge of RAM was pretty far off. Thanks for the detailed explanation!! I'll answer your questions in order:

Graphs don't match:

The top graph covers Aug 9 00:00 - Aug 13 14:30 Bottom graph covers Aug 13 10:00 - Aug 13 14:30

The small sliver at the end of the top graph is the same data as the bottom graph. Bottom graph is just exploded to be easier to view.

FYI charts were created by using Elasticsearch, Kibana and beats agents (all free!). We happen to be doing a POC, and was using our hive server as one of the test devices, hence accidentally having days of data on hand.

The results of Java config:

ckinsel@hiveapp1:~$ java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize'
intx CompilerThreadStackSize                   = 0                                   {pd product}
uintx ErgoHeapSizeLimit                         = 0                                   {product}
uintx HeapSizePerGCThread                       = 87241520                            {product}
uintx InitialHeapSize                          := 264241152                           {product}
uintx LargePageHeapSizeThreshold                = 134217728                           {product}
uintx MaxHeapSize                              := 4206886912                          {product}
intx ThreadStackSize                           = 1024                                {pd product}
intx VMThreadStackSize                         = 1024                                {pd product}
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

The results of Elastic config: (Default 2gb)

elastic+  1326     1  1 Aug13 ?        00:14:55 /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Edefault.path.logs=/var/log/elasticsearch -Edefault.path.data=/var/lib/elasticsearch -Edefault.path.conf=/etc/elasticsearch

Cortex usage (low)

We're still relatively green with cortex, and run less than 50 analyzers per day, probably closer to 20. And those mostly consist of Virustotal\Urlscan lookups.

Cortex cache.job value (default)

## Cache
#
# If an analyzer is executed against the same observable, the previous report can be returned without re-executing the
# analyzer. The cache is used only if the second job occurs within cache.job (the default is 10 minutes).
cache.job = 10 minutes

Alright, I'll continue to monitor Free data, the same manual chart for cortex.service, and more of those elasticsearch charts; Specifically looking to see if Used memory grows, or stays steady.

Also, haven't had a cortex\thehive crash in about 48 hours. So it's either due, or has somehow solved itself.

cameronkinsel commented 5 years ago

Graph 3:

Screen Shot 2019-08-14 at 9 18 59 AM

This data is a continuation of graph 2, and covers Aug 13 10:00 - 20:00. The cutoff happens because we're changing some config settings on the logging agents which make the graphs possible, but doesn't correspond to any issue on the server related to cortex.

It does show that Cache eventually used up all of Free memory, but also shows Used memory slowly increase as well. Unfortunately we needed to reboot the Hive server to apply patches, which has essentially reset our clock. I'll continue to monitor and paste results here when we eventually get another cortex failure.

ITServ-DE commented 5 years ago

I agree on your interpretation of the graph. You see free runs down to a minumal value which is kept, and buffer cache is primarly responsible for that. And yes, obvisiously the used mem grows.

A few words on the Java memory stuff. The default value for maxheapsize is 4 GB. I understand this as the JVM will not go beyond this point. 4GB for Hive + 4 GB for Cortex + 2 GB for Elastic leaves 6 GB for all the rest, which is plenty of RAM.

The default setting for caching the analzer results is good, in combination with 20 or 50 analyses per day, so I don't think excessive caching is the reason for Cortex to run into OutOfMemoryError: Java heap space.

I went back to the beginning of the tread an re-read what your problem is. To me it looks as Cortex's and TheHive's JVMs run out of memory. There are two possibilities for that:

I think the latter is the case, so you should continue recording the per-process memory allocation.

I would recommend running a cron job which records for every process ID of the host the vsz and rss. The goal is to find the process whose vsz and/or rss grows beyond any limit.

cameronkinsel commented 5 years ago

Thank you for sticking with me through this point. I'll create a cron job to measure this to attempt to find this rogue process, but I'm still thinking it's cortex.

In the very first graph I attached, the memory flatline was at a time when the server was operating normally, except with the cortex service disabled. Once I re-enabled cortex, the memory started a constant downward slope.

In theory, if I were to disable cortex, and not see 'used' continuously grow over time, then re-enable and see constant growth, I've identified the process correct?

Below image is Aug 14 8:00-Aug 15 8:00 Screen Shot 2019-08-15 at 8 20 41 AM

I am going to write the cron job, and disable cortex again for 2 hours, then re-enable. If we see a flatline, then a steady decrease again after being enabled, we have the culprit.

ITServ-DE commented 5 years ago

I belive your assumption is right. If you shut down cortex, and yo have a steady used, you actually identified a growing process. The question is: is this growth normal?

cameronkinsel commented 5 years ago

I believe it's abnormal, as the 'out-of-memory' issue has not been observed by us in the past, and we've not made any configuration changes to TheHive\Cortex for months. Unfortunately my system resource usage statistics started logging after we first noticed the issue, so I'm unable to look back at my actual performance metrics.

Actually, you would be a good candidate to benchmark against, since you have a healthy cortex instance. How much memory does your Cortex.service utilize while running normally? Does it grow over hours then settle around 2\4 Gb? Or stay fairly constant after starting?

cameronkinsel commented 5 years ago

Screen Shot 2019-08-16 at 8 18 56 AM

Notable timestamps: 00:00 - 08:00: Steady increase in memory usage 08:00: Stop cortex.service 08:00 - 10:30: Steady Memory usage 10:30: Start cortex.service 10:30 - 14:00: Steady increase in memory usage 14:00: Restart TheHive.service and Elasticsearch.service 14:00 - 03:00 Steady increase in memory usage

This pegs the memory growth to cortex.service. The other java processes were executing normally when cortex was down, with no growth. Then immediately after enabling it again, it began again.

As you've stated, it's possible that Cortex.service memory usage growing toward the Java heap size is normal, so I'm still interested in comparison with a healthy cortex environment.

ITServ-DE commented 5 years ago

I only have Cortex/Hive on an unmonitored test server, and there is only very little done with it. As said, it's a test server.

I will setup a memory reporting like yours, but I'm not to optimistic we'd see something of interest.

ITServ-DE commented 5 years ago

OK, here are a few hours of memory allocation of my test machine, starting at 8.15am and ending 1.15pm

ss004

The machine hosts ElasticSearch 5.6, Hive 3.4.0-RC2, Cortex 3.0.0-RC4 and MISP (with it's database) The first little spike was a restart of Cortex, than the machine was idle for quite some time. At the end, I ran about 350 analyzes on observables and looked them up in MISP.

I cannot tell if the growth of used is due to caching in Cortex, or database caching from MISP. So this was probably not a well-thought analyzer I ran :-(

You can't see much, because I did not track vsz and rss for each of Hive/Cortex/ES/MySQL, but you can see when the machine sits idle, without any user doing something, it does not leak memory. There are sparks, but they return to their original value.

ITServ-DE commented 5 years ago

I now additionally run the following data collector for the memory consumption of Cortex, TheHive and ES:

#!/bin/bash

TEMP=`mktemp`
ps -e -o "user vsz rss" | egrep 'elastic|thehive|cortex' | sort >$TEMP

awk '{printf("%d\t%d\t",$2,$3)}' < $TEMP
echo

rm $TEMP

running it gives the following output:

root@hive:~/bin# ./ppmemcollect                                                                                                                                                                                    
5905540 633468  6071136 2678580 5905792 641388  

The first two columns are vsz and rss for Cortex, the next two for ES, and the last two are TheHive. I'll let this run for 24h and see how memory consumption of the JVMs work.

cameronkinsel commented 5 years ago

Very interesting. Your chart is totally what I would expect from the system after being online for a while.

You also are running a newer version of TheHive, and much newer version of Cortex. I'm going to clone and test these updates to see if maybe this is addressed by one of the updates, and would like to also test increasing the Heap size to 6\8gb instead.

ITServ-DE commented 5 years ago

I would really stick with the default values. These fit very well on my 8GB machine, so they would certainly match your host too. You could increase memory to tune your installation when your problems are gone, but better not now.

And from the lower chart you see that more memory will likely do nothing, as RSS is significant lower than VSZ. The only benificial tuning I see could be increasing memory for elasticsearch if you have a huge database.

However ...

Here are the newest charts.

14:00 - Cortex Restart 15:00 - DNS lookups for 390 IP-Addresses 23:00 - Cortex Restart 0.00 up to recent - every minute, one random IP address is looked up.

And the memory consumption looks beautyful:

Snag_42febd74

Now the virtual process sizes (vsz) and the actual memory usage (rss) for cortex, ES and TheHive: Snag_4316cf9e

Everything smooth and flat, except for Cortex, which has a very low RSS at the beginning and growing to a somewhat fixed value.

cameronkinsel commented 5 years ago

Haha those charts are very nice.

Last 24 hours: Screen Shot 2019-08-20 at 7 55 56 AM

Summary, memory usage increases without fail. TheHive\Cortex sees almost 0 usage after-hours, yet still seeing the increase.

This morning, none of the services have failed:

ckinsel@hiveapp1:~$ sudo systemctl status thehive.service 
[sudo] password for ckinsel: 
● thehive.service - TheHive
   Loaded: loaded (/usr/lib/systemd/system/thehive.service; enabled; vendor pres
   Active: active (running) since Mon 2019-08-19 08:37:52 EDT; 23h ago
     Docs: https://thehive-project.org
 Main PID: 13657 (java)
    Tasks: 73
   Memory: 4.4G
      CPU: 19min 11.439s
   CGroup: /system.slice/thehive.service
           └─13657 java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/appli

Warning: Journal has been rotated since unit was started. Log output is incomple
lines 1-12/12 (END)
ckinsel@hiveapp1:~$ sudo systemctl status cortex.service 
● cortex.service - cortex
   Loaded: loaded (/etc/systemd/system/cortex.service; enabled; vendor preset: e
   Active: active (running) since Mon 2019-08-19 08:51:40 EDT; 23h ago
     Docs: https://thehive-project.org
 Main PID: 17618 (java)
    Tasks: 47
   Memory: 3.4G
      CPU: 2h 43min 31.997s
   CGroup: /system.slice/cortex.service
           └─17618 java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/applica

Warning: Journal has been rotated since unit was started. Log output is incomple
lines 1-12/12 (END)
ckinsel@hiveapp1:~$ sudo systemctl status elasticsearch.service 
● elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendo
   Active: active (running) since Wed 2019-08-14 09:08:45 EDT; 5 days ago
     Docs: http://www.elastic.co
 Main PID: 1331 (java)
    Tasks: 78
   Memory: 2.5G
      CPU: 1h 27min 18.434s
   CGroup: /system.slice/elasticsearch.service
           └─1331 /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSIni

Warning: Journal has been rotated since unit was started. Log output is incomple

ckinsel@hiveapp1:~$ ps -e -o "user vsz rss" | egrep 'elastic|thehive|cortex'
elastic+ 6932032 2676004
thehive  8151512 4634072
cortex   7969652 3682164

At this point I'm really not sure what direction to go. I'll track vsz and rss for thehive+cortex after the next crash (from fresh) so that I can compare against your described observation above .

I know for sure that by stoping cortex.service I immediately stop growing in memory usage, so whatever growth is happening, it is definitely related to cortex running somehow.

ITServ-DE commented 5 years ago

If you monitor vsz and rss, I'd expect to see rss growing towards vsz, until it reaches it's limit and then, let me guess, Cortex crashes. Based on the assumption that Cortex is the candidate - as you said before.

I believe your assumption, but as an IT pro I can only be sure when I saw vsz and rss logs. :-)

From my point of view, you have various options, with various effords.

1) change your JVM. If you run openJVM, switch to Oracle JVM and vice versa, to proof the JVM is guilty/not guilty. Maybe the latest update to your JVM introduced a memory leak. This would be my favorite. 2) install jconsole and use JVM-internal tools for monitoring memory consumption (there, I'm totally out of business. I have no clue how this works) 3) upgrade your Cortex. The 3.0.0-RC4 I run did not make any problem for me, but it's only a test environment. Your mileage may vary when you're in production. 4) Spill fuel over your server and light up the whole stuff. This will probably fix the issue for quite a long time - at least for you-, depending on your local laws and the period of imprisonment you will get for firestarting. No, wait ... don't do it!

cameronkinsel commented 5 years ago

Working on the vsz\rss stats now. For reference, our java -version output (spoiler, its open JDK):

ckinsel@hiveapp1:~$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

As for 4. , I do have my red stapler with me...

ITServ-DE commented 5 years ago

Well. openjdk version "1.8.0_222"? Version one-dot-eight???

I run

openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)
cameronkinsel commented 5 years ago

Running sudo update-alternatives --config java reports that OpenJDK 8 headless is actually installed.

We've updated to openjdk 11 to match yours, and started everything. A very early look does not seem to show measurable RAM usage over time so far. I'll be out of office until Monday. If we are able to run with no issues until then, I'll consider this issue solved, with updating Java being the solution.

cameronkinsel commented 5 years ago

So, installing the newer openjdk 11 definitely changed the situation. TheHive+Cortex have been executing for 5 days with no crashes, and very minor Memory usage increasing over time. (Somewhat expected as there is still quite a large pool of free memory)

Below graph represents 5 days of data Screen Shot 2019-08-26 at 9 35 19 AM

I'm going to let things execute, and if I get no issues, will resolve this case by the end of the week.

ITServ-DE commented 5 years ago

@cameronkinsel , can this issue be closed?

vi-or-die commented 5 years ago

@github-pba I have been working with cameron on this. After updating Java the server no longer crashes or see slow/rapid loss of memory. Yet now it has started consuming so much CPU time that everything just stops running. As you can see below we are just maxing out the CPU with TheHive. We ensured that all patches were installed yesterday and even upgraded TheHive to version: 3.4.0-1. Any thoughts?

Service Information

james.cribbs@hiveapp1:~$ sudo systemctl status thehive.service
[sudo] password for james.cribbs: 
Sorry, try again.
[sudo] password for james.cribbs: 
● thehive.service - TheHive
   Loaded: loaded (/usr/lib/systemd/system/thehive.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-09-12 18:06:06 UTC; 1h 15min ago
     Docs: https://thehive-project.org
 Main PID: 26705 (java)
    Tasks: 56
   Memory: 797.6M
      CPU: 4h 7min 39.294s
   CGroup: /system.slice/thehive.service
           └─26705 java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logback.xml -Dpidfile.path=/dev/null -cp /opt/thehive/lib/../conf/:/opt/thehive/lib/org.thehive-project.thehive-3.4.0-1

Sep 12 18:06:06 hiveapp1 systemd[1]: Started TheHive.

Screenshot of memory & CPU

Screen Shot 2019-09-12 at 3 25 48 PM

Screenshot of HTOP

Screen Shot 2019-09-12 at 3 36 15 PM

ps -eo pcpu,pid,user,args | sort -k1 -r -n | head -10
 323 26705 thehive  java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logback.xml -Dpidfile.path=/dev/null -cp /opt/thehive/lib/../conf/:/opt/thehive/lib/org.thehive-project.thehive-3.4.0-1-sans-externalized.jar:/opt/thehive/lib/org.thehive-project.thehivebackend-3.4.0-1.jar:/opt/thehive/lib/org.thehive-project.thehivemisp-3.4.0-1.jar:/opt/thehive/lib/org.thehive-project.thehivecortex-3.4.0-1.jar:/opt/thehive/lib/org.scala-lang.scala-library-2.12.6.jar:/opt/thehive/lib/com.typesafe.play.twirl-api_2.12-1.3.15.jar:/opt/thehive/lib/org.scala-lang.modules.scala-xml_2.12-1.0.6.jar:/opt/thehive/lib/com.typesafe.play.play-server_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.build-link-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-exceptions-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-netty-utils-2.6.23.jar:/opt/thehive/lib/org.slf4j.slf4j-api-1.7.25.jar:/opt/thehive/lib/org.slf4j.jul-to-slf4j-1.7.25.jar:/opt/thehive/lib/org.slf4j.jcl-over-slf4j-1.7.25.jar:/opt/thehive/lib/com.typesafe.play.play-streams_2.12-2.6.23.jar:/opt/thehive/lib/org.reactivestreams.reactive-streams-1.0.2.jar:/opt/thehive/lib/com.typesafe.akka.akka-stream_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.akka.akka-actor_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.config-1.3.3.jar:/opt/thehive/lib/org.scala-lang.modules.scala-java8-compat_2.12-0.8.0.jar:/opt/thehive/lib/com.typesafe.akka.akka-protobuf_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.ssl-config-core_2.12-0.3.7.jar:/opt/thehive/lib/com.typesafe.akka.akka-slf4j_2.12-2.5.21.jar:/opt/thehive/lib/com.fasterxml.jackson.datatype.jackson-datatype-jdk8-2.8.11.jar:/opt/thehive/lib/com.fasterxml.jackson.datatype.jackson-datatype-jsr310-2.8.11.jar:/opt/thehive/lib/commons-codec.commons-codec-1.11.jar:/opt/thehive/lib/com.typesafe.play.play-json_2.12-2.6.12.jar:/opt/thehive/lib/com.typesafe.play.play-functional_2.12-2.6.12.jar:/opt/thehive/lib/org.scala-lang.scala-reflect-2.12.6.jar:/opt/thehive/lib/org.typelevel.macro-compat_2.12-1.1.1.jar:/opt/thehive/lib/joda-time.joda-time-2.9.9.jar:/opt/thehive/lib/org.checkerframework.checker-compat-qual-2.0.0.jar:/opt/thehive/lib/com.google.errorprone.error_prone_annotations-2.1.3.jar:/opt/thehive/lib/com.google.j2objc.j2objc-annotations-1.1.jar:/opt/thehive/lib/org.codehaus.mojo.animal-sniffer-annotations-1.14.jar:/opt/thehive/lib/io.jsonwebtoken.jjwt-0.7.0.jar:/opt/thehive/lib/javax.xml.bind.jaxb-api-2.3.1.jar:/opt/thehive/lib/javax.activation.javax.activation-api-1.2.0.jar:/opt/thehive/lib/org.apache.commons.commons-lang3-3.6.jar:/opt/thehive/lib/javax.transaction.jta-1.1.jar:/opt/thehive/lib/javax.inject.javax.inject-1.jar:/opt/thehive/lib/com.typesafe.play.filters-helpers_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-logback_2.12-2.6.23.jar:/opt/thehive/lib/ch.qos.logback.logback-classic-1.2.3.jar:/opt/thehive/lib/ch.qos.logback.logback-core-1.2.3.jar:/opt/thehive/lib/com.typesafe.play.play-akka-http-server_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.akka.akka-http-core_2.12-10.0.15.jar:/opt/thehive/lib/com.typesafe.akka.akka-parsing_2.12-10.0.15.jar:/opt/thehive/lib/org.apache.logging.log4j.log4j-to-slf4j-2.9.1.jar:/opt/thehive/lib/com.typesafe.play.play-ehcache_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-cache_2.12-2.6.23.jar:/opt/thehive/lib/net.sf.ehcache.ehcache-2.10.6.jar:/opt/thehive/lib/org.ehcache.jcache-1.0.1.jar:/opt/thehive/lib/javax.cache.cache-api-1.0.0.jar:/opt/thehive/lib/com.typesafe.play.play-ws_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-ws-standalone_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-ws-standalone-xml_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-ws-standalone-json_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-ahc-ws_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-ahc-ws-standalone_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.cachecontrol_2.12-1.1.4.jar:/opt/thehive/lib/org.scala-lang.modules.scala-parser-combinators_2.12-1.1.0.jar:/opt/thehive/lib/org.joda.joda-convert-1.9.2.jar:/opt/thehive/lib/com.typesafe.play.shaded-asynchttpclient-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.shaded-oauth-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-guice_2.12-2.6.23.jar:/opt/thehive/lib/aopalliance.aopalliance-1.0.jar:/opt/thehive/lib/com.google.inject.extensions.guice-assistedinject-4.1.0.jar:/opt/thehive/lib/net.codingwell.scala-guice_2.12-4.2.3.jar:/opt/thehive/lib/com.google.inject.guice-4.2.2.jar:/opt/thehive/lib/com.google.guava.guava-25.1-android.jar:/opt/thehive/lib/com.google.code.findbugs.jsr305-3.0.2.jar:/opt/thehive/lib/org.thehive-project.elastic4play_2.12-1.11.5.jar:/opt/thehive/lib/com.typesafe.play.play-akka-http2-support_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.akka.akka-http2-support_2.12-10.0.15.jar:/opt/thehive/lib/com.twitter.hpack-1.0.2.jar:/opt/thehive/lib/org.eclipse.jetty.alpn.alpn-api-1.1.3.v20160715.jar:/opt/thehive/lib/com.sksamuel.elastic4s.elastic4s-core_2.12-6.5.1.jar:/opt/thehive/lib/com.sksamuel.exts.exts_2.12-1.60.0.jar:/opt/thehive/lib/com.fasterxml.jackson.core.jackson-core-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.core.jackson-databind-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.module.jackson-module-scala_2.12-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.core.jackson-annotations-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.module.jackson-module-paranamer-2.9.6.jar:/opt/thehive/lib/com.thoughtworks.paranamer.paranamer-2.8.jar:/opt/thehive/lib/com.sksamuel.elastic4s.elastic4s-http-streams_2.12-6.5.1.jar:/opt/thehive/lib/com.sksamuel.elastic4s.elastic4s-http_2.12-6.5.1.jar:/opt/thehive/lib/org.elasticsearch.client.elasticsearch-rest-client-6.5.2.jar:/opt/thehive/lib/org.apache.httpcomponents.httpclient-4.5.2.jar:/opt/thehive/lib/org.apache.httpcomponents.httpcore-4.4.5.jar:/opt/thehive/lib/org.apache.httpcomponents.httpasyncclient-4.1.2.jar:/opt/thehive/lib/org.apache.httpcomponents.httpcore-nio-4.4.5.jar:/opt/thehive/lib/commons-logging.commons-logging-1.1.3.jar:/opt/thehive/lib/org.scalactic.scalactic_2.12-3.0.5.jar:/opt/thehive/lib/org.bouncycastle.bcprov-jdk15on-1.58.jar:/opt/thehive/lib/net.lingala.zip4j.zip4j-1.3.2.jar:/opt/thehive/lib/org.reflections.reflections-0.9.11.jar:/opt/thehive/lib/org.javassist.javassist-3.21.0-GA.jar:/opt/thehive/lib/com.typesafe.akka.akka-cluster_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.akka.akka-remote_2.12-2.5.21.jar:/opt/thehive/lib/io.netty.netty-3.10.6.Final.jar:/opt/thehive/lib/io.aeron.aeron-driver-1.15.1.jar:/opt/thehive/lib/org.agrona.agrona-0.9.31.jar:/opt/thehive/lib/io.aeron.aeron-client-1.15.1.jar:/opt/thehive/lib/com.typesafe.akka.akka-cluster-tools_2.12-2.5.21.jar:/opt/thehive/lib/org.thehive-project.thehive-3.4.0-1-assets.jar play.core.server.ProdServerStart
 1.6  1322 elastic+ /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Edefault.path.logs=/var/log/elasticsearch -Edefault.path.data=/var/lib/elasticsearch -Edefault.path.conf=/etc/elasticsearch
vi-or-die commented 5 years ago

Decided I wanted to dig deeper into this issue myself. I found this article that describes how to extract the thread causing problems in a Java process. Below is the result of this investigation.

  1. Determine the process ID (PID) of the affected server process using the following command

    james.cribbs@hiveapp1:~$ top
    [...]
    29519 thehive   20   0 8054624 0.996g  28720 S 395.0  6.4 575:03.64 java
    [...]
  2. Determine which thread in the PID identified in step 1 is consuming the CPU

    
    james.cribbs@hiveapp1:~$ top -n 1 -H -p 29519

top - 17:10:59 up 2 days, 23:55, 2 users, load average: 8.60, 8.09, 8.02 Threads: 62 total, 9 running, 53 sleeping, 0 stopped, 0 zombie %Cpu(s): 70.6 us, 0.5 sy, 0.0 ni, 28.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16432288 total, 6473660 free, 5272176 used, 4686452 buff/cache KiB Swap: 999420 total, 999420 free, 0 used. 10698876 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29607 thehive 20 0 8054624 0.996g 28776 R 68.8 6.4 47:26.80 application-akk
29605 thehive 20 0 8054624 0.996g 28776 R 50.0 6.4 101:52.21 application-akk
29606 thehive 20 0 8054624 0.996g 28776 R 43.8 6.4 47:54.49 application-akk
29616 thehive 20 0 8054624 0.996g 28776 R 43.8 6.4 103:59.84 application-akk
29625 thehive 20 0 8054624 0.996g 28776 R 43.8 6.4 101:35.80 application-akk
29627 thehive 20 0 8054624 0.996g 28776 R 37.5 6.4 47:28.03 application-akk
29628 thehive 20 0 8054624 0.996g 28776 R 37.5 6.4 250:11.96 application-akk
29626 thehive 20 0 8054624 0.996g 28776 R 31.2 6.4 100:51.50 application-akk
29519 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.02 java
29581 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:10.00 java
29582 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.41 GC Thread#0
29583 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 G1 Main Marker
29584 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:01.28 G1 Conc#0
29585 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 G1 Refine#0
29586 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:26.43 G1 Young RemSet
29587 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 1:37.96 VM Thread
29588 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.04 Reference Handl
29589 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.31 Finalizer
29590 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 Signal Dispatch
29591 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 3:22.20 C2 CompilerThre
29592 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:28.79 C1 CompilerThre
29593 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 1:05.23 Sweeper thread
29594 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 Service Thread
29595 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:46.83 VM Periodic Tas
29596 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.07 Common-Cleaner
29597 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.59 GC Thread#1
29598 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.33 GC Thread#2
29599 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.37 GC Thread#3
29601 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.02 AsyncAppender-W
29602 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.02 AsyncAppender-W
29604 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 3:52.67 application-sch
29610 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.33 New I/O worker
29611 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.33 New I/O worker
29612 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.95 New I/O boss #3
29613 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.41 New I/O worker
29614 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.51 New I/O worker
29615 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 New I/O server
29617 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 DEFAULT
29619 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:03.55 pool-1-thread-1
29620 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:04.18 I/O dispatcher
29621 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:04.71 I/O dispatcher
29622 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:26.99 I/O dispatcher
29623 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:21.43 I/O dispatcher
29624 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.08 Statistics Thre
29629 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:25.06 pool-2-thread-1
29630 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:24.85 pool-3-thread-1
29631 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 com.google.comm
29632 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:25.85 pool-4-thread-1
29633 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:25.12 pool-5-thread-1
29634 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.55 ObjectCleanerTh
29635 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:04.02 AsyncHttpClient
29636 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:07.75 AsyncHttpClient
29639 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:09.21 application-akk
27420 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.48 AsyncHttpClient
27459 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.43 AsyncHttpClient
27478 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.43 AsyncHttpClient


3. Produce a stack trace for the PID identified in step 1 using the following command:

james.cribbs@hiveapp1:~$ sudo jstack 29519 > jstack-output-29519.txt

4. Convert the thread ID identified as problematic in step 2 to a hex value. 
`29607` to `73A7`
5. Search the stack trace output for this hex value using grep. You are looking for a thread nid that matches this hex value:

james.cribbs@hiveapp1:~$ cat jstack-output-29519.txt | grep -i 73A7 "application-akka.actor.default-dispatcher-4" #17 prio=5 os_prio=0 cpu=2467719.98ms elapsed=76490.62s tid=0x00007fe9d1c5b800 nid=0x73a7 runnable [0x00007fe9715de000]



According to the end of that post it looks like the issue is related to whatever `application-akka.actor.default-dispatcher-4` is doing. 
ITServ-DE commented 5 years ago

Cameron, James,

thank you for sharing your thoughts and your investigation results here. I'm sad you/cameron traded one problem for another.

I'm on vacation the next two weeks, so I can provide no support. However, you dug so deep, I don't think I can help any further. But one more idea:

You have a process running as wild as possible, and we know that's uncommon (as there are probably hundrets of installations working well). I would now use strace to trace the system calls the JVM executes when TheHive sits idle. With some luck, you can track down what the JVM tries do to, fails, and immediately retries. Maybe you can track it to network activity and relate it to some special network hardening you do for your servers, or something like this.

There are statistics and profiling options for strace (see man strace), this will quickly show you what OS calls are done on your machine. If you're unlucky, the JVM spins in a circle outside of system calls.

I hope you'll have success, and I will probably follow the thread when I'm back from my holidays.

cameronkinsel commented 5 years ago

Travel safe and thanks for sticking with us for so long! We'll post data here if we are able to find the root cause.

ITServ-DE commented 5 years ago

When your server no longer works, can you still ping and login, or is the system so unresponsive you have to reset it?

I have trouble with a mail server which occasionally saturates one of four cores and does no longer react on anything. No ping, no login, no fun. This happens since I upgraded the OS. Maybe there is something with the kernel ...?

vi-or-die commented 5 years ago

When it bottoms out we can’t use thinks like systemctl stop thehive.service to shut it down. It requires kill -9 [pid]. This means that if the service is hung restarting the box hangs during shutdown and it requires a hard reset. Haven’t seen it lock up to the point ping doesn’t work. We will take that trace command for a spin and keep working. Once again thanks for your help!

On Sat, Sep 14, 2019 at 12:44 github-pba notifications@github.com wrote:

When your server no longer works, can you still ping and login, or is the system so unresponsive you have to reset it?

I have trouble with a mail server which occasionally saturates one of four cores and does no longer react on anything. No ping, no login, no fun. This happens since I upgraded the OS. Maybe there is something with the kernel ...?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/TheHive-Project/Cortex/issues/214?email_source=notifications&email_token=AEU6LFQTDUJ4B5I6ZAZCCN3QJUIGPA5CNFSM4ILBQBUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6W7MYA#issuecomment-531494496, or mute the thread https://github.com/notifications/unsubscribe-auth/AEU6LFSRJG4ZF5ERKRS4B5TQJUIGPANCNFSM4ILBQBUA .

--

BrijJhala commented 4 years ago

Memory leak issue is because of akka playframework http2alpnSupport module: https://github.com/akka/akka-http/issues/2462#issuecomment-703934326 I have provided detail explanation about heap dump.

BrijJhala commented 4 years ago

Right now quick fix to disable http2.enabled=no in application.conf. @nadouani . we have run jemter test and did profiling. There is no memory leaks after this settings. Unfortunately http 1.1 is slower than Http2 but its ok rather than having memory leaks. Memory leaks usually occurred after 48k cortex jobs with 1.5 GB heap size. More detail about memory leaks and profiling result is above comment with link to akka-http.

nadouani commented 4 years ago

Quick question: do you use SSL/HTTPS using Cortex without a reverse proxy?

BrijJhala commented 4 years ago

Hi @nadouani , We do use SSL/HTTP but not reverse proxy. Cortex binary is bundled in our custom Docker file and deployed on Openshift Pod. so cortex is running in headless mode.

nadouani commented 4 years ago

That's the reason of the memory leak then. This is a bug in Akka and that's why we don't recommend enabling SSL directly by Cortex: https://github.com/TheHive-Project/CortexDocs/blob/master/admin/admin-guide.md#https

We try to upgrade the version of playframework whenever possible, so if that issue is solved in Akka, we will fix it by upgrading Playframework to the version that supports the Akka fix.

BrijJhala commented 4 years ago

Thank you @nadouani .Appreciate your response and guidance. Initially I have followed up the digital ocean SSL configuration guide was mentioned in cortex doc. now we will explore nginx. OR disable http2 and proceed further for a now.