Open cameronkinsel opened 5 years ago
on my Hive/Cortex/MISP-Server, I only have three running Java processes. One for TheHive, one for Cortex, one for Elasticsearch.
When I get your screenshot right, you have multiple Cortex/Hive instances running simultaniously. For example, I can quickly identify six Cortex: PIDs 1270, 1805, 1806, 1808 and 1817.
That does not seem right, and that would be the first thing I would look at.
I would run systemctl disable thehive; systemctl disable cortex
to disable starting and stopping and reboot the server. When the host comes up again, I'd make sure no thehive/cortex is running.
Then, I would manually start cortex and thehive with /etc/init.d/thehive start; /etc/init.d/cortex start
and check that only one instance (=java process) is running for each of them.
That would be my starting point.
From that point, I would start tracking the number of java processes on the system. Maybe some cron job tries to restart thehive/cortex, but accidently only starts new processes which over the time eat up your memory.
For completeness, here is what runs on my host:
root@hive:~# ps -eo "pid user args" | grep java | grep -v grep| cut -c1-120
9514 thehive java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logba
31875 cortex java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/application.conf -Dlogger.file=/etc/cortex/logback.
32418 elastic+ /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInit
Nice and clean, one java instance for each of them.
@github-pba Thanks for the reply!
Yesterday we ran the server with cortex.service stopped, and came in this morning with no failures. If it is something as you mentioned, where multiple instances are being executed, I believe cortex.service is the culprit.
The output I had generated before was using htop. When I run your 'ps -eo' command I get the same result as you:
1348 elastic+ /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInit
25670 cortex java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/application.conf -Dlogger.file=/etc/cortex/logback.
28670 thehive java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logba
However, at the time I executed the above, I was not seeing the resource issue. I will check throughout today, and if we notice the excessive resource usage, I'll run it again and compare here.
You could monitor your resource consumption by periodically calling ps -eo "vsz rss args"
and do some magic with the output (aka sort it). vsz
is the virtual process size, rss
the resident set size, meaning how much memory is physically allocated.
By tracking these, you should be able to determine if a process grows.
With more digging it seems like my understanding of htop is flawed because even now (when I'm not having an issue,) I'm seeing many unique PIDs for Java, but when I use 'top' or 'ps -ef' I'm only seeing one per cortex\elastic\thehive.
Interestingly, the 'command' for cortex\thehive through 'ps' is insanely long. The below is one PID:
ckinsel@hiveapp1:~$ps -ef | grep java
cortex 25670 1 3 07:56 ? 00:00:39 java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/application.conf -Dlogger.file=/etc/cortex/logback.xml -Dpidfile.path=/dev/null -cp /opt/cortex/lib/../conf/:/opt/cortex/lib/org.thehive-project.cortex-2.1.3-1-sans-externalized.jar:/opt/cortex/lib/org.scala-lang.scala-library-2.12.7.jar:/opt/cortex/lib/com.typesafe.play.twirl-api_2.12-1.3.15.jar:/opt/cortex/lib/org.scala-lang.modules.scala-xml_2.12-1.0.6.jar:/opt/cortex/lib/com.typesafe.play.play-server_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.build-link-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-exceptions-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-netty-utils-2.6.20.jar:/opt/cortex/lib/org.slf4j.slf4j-api-1.7.25.jar:/opt/cortex/lib/org.slf4j.jul-to-slf4j-1.7.25.jar:/opt/cortex/lib/org.slf4j.jcl-over-slf4j-1.7.25.jar:/opt/cortex/lib/com.typesafe.play.play-streams_2.12-2.6.20.jar:/opt/cortex/lib/org.reactivestreams.reactive-streams-1.0.2.jar:/opt/cortex/lib/com.typesafe.akka.akka-stream_2.12-2.5.17.jar:/opt/cortex/lib/com.typesafe.akka.akka-actor_2.12-2.5.17.jar:/opt/cortex/lib/com.typesafe.config-1.3.3.jar:/opt/cortex/lib/org.scala-lang.modules.scala-java8-compat_2.12-0.8.0.jar:/opt/cortex/lib/com.typesafe.akka.akka-protobuf_2.12-2.5.17.jar:/opt/cortex/lib/com.typesafe.ssl-config-core_2.12-0.2.4.jar:/opt/cortex/lib/org.scala-lang.modules.scala-parser-combinators_2.12-1.0.6.jar:/opt/cortex/lib/com.typesafe.akka.akka-slf4j_2.12-2.5.17.jar:/opt/cortex/lib/com.fasterxml.jackson.core.jackson-core-2.8.11.jar:/opt/cortex/lib/com.fasterxml.jackson.core.jackson-annotations-2.8.11.jar:/opt/cortex/lib/com.fasterxml.jackson.datatype.jackson-datatype-jdk8-2.8.11.jar:/opt/cortex/lib/com.fasterxml.jackson.core.jackson-databind-2.8.11.1.jar:/opt/cortex/lib/com.fasterxml.jackson.datatype.jackson-datatype-jsr310-2.8.11.jar:/opt/cortex/lib/commons-codec.commons-codec-1.10.jar:/opt/cortex/lib/com.typesafe.play.play-json_2.12-2.6.10.jar:/opt/cortex/lib/com.typesafe.play.play-functional_2.12-2.6.10.jar:/opt/cortex/lib/org.scala-lang.scala-reflect-2.12.7.jar:/opt/cortex/lib/org.typelevel.macro-compat_2.12-1.1.1.jar:/opt/cortex/lib/joda-time.joda-time-2.9.9.jar:/opt/cortex/lib/com.google.guava.guava-22.0.jar:/opt/cortex/lib/com.google.errorprone.error_prone_annotations-2.0.18.jar:/opt/cortex/lib/com.google.j2objc.j2objc-annotations-1.1.jar:/opt/cortex/lib/org.codehaus.mojo.animal-sniffer-annotations-1.14.jar:/opt/cortex/lib/io.jsonwebtoken.jjwt-0.7.0.jar:/opt/cortex/lib/javax.xml.bind.jaxb-api-2.3.0.jar:/opt/cortex/lib/org.apache.commons.commons-lang3-3.6.jar:/opt/cortex/lib/javax.transaction.jta-1.1.jar:/opt/cortex/lib/javax.inject.javax.inject-1.jar:/opt/cortex/lib/com.typesafe.play.filters-helpers_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-logback_2.12-2.6.20.jar:/opt/cortex/lib/ch.qos.logback.logback-classic-1.2.3.jar:/opt/cortex/lib/ch.qos.logback.logback-core-1.2.3.jar:/opt/cortex/lib/com.typesafe.play.play-akka-http-server_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.akka.akka-http-core_2.12-10.0.14.jar:/opt/cortex/lib/com.typesafe.akka.akka-parsing_2.12-10.0.14.jar:/opt/cortex/lib/org.apache.logging.log4j.log4j-to-slf4j-2.9.1.jar:/opt/cortex/lib/com.typesafe.play.play-ehcache_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-cache_2.12-2.6.20.jar:/opt/cortex/lib/net.sf.ehcache.ehcache-2.10.4.jar:/opt/cortex/lib/org.ehcache.jcache-1.0.1.jar:/opt/cortex/lib/javax.cache.cache-api-1.0.0.jar:/opt/cortex/lib/com.typesafe.play.play-ws_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.play.play-ws-standalone_2.12-1.1.10.jar:/opt/cortex/lib/com.typesafe.play.play-ws-standalone-xml_2.12-1.1.10.jar:/opt/cortex/lib/com.typesafe.play.play-ws-standalone-json_2.12-1.1.10.jar:/opt/cortex/lib/com.typesafe.play.play-guice_2.12-2.6.20.jar:/opt/cortex/lib/com.google.inject.guice-4.1.0.jar:/opt/cortex/lib/aopalliance.aopalliance-1.0.jar:/opt/cortex/lib/com.google.inject.extensions.guice-assistedinject-4.1.0.jar:/opt/cortex/lib/net.codingwell.scala-guice_2.12-4.1.0.jar:/opt/cortex/lib/com.google.inject.extensions.guice-multibindings-4.1.0.jar:/opt/cortex/lib/com.google.code.findbugs.jsr305-3.0.1.jar:/opt/cortex/lib/org.thehive-project.elastic4play_2.12-1.7.2.jar:/opt/cortex/lib/com.typesafe.play.play-akka-http2-support_2.12-2.6.20.jar:/opt/cortex/lib/com.typesafe.akka.akka-http2-support_2.12-10.0.14.jar:/opt/cortex/lib/com.twitter.hpack-1.0.2.jar:/opt/cortex/lib/org.eclipse.jetty.alpn.alpn-api-1.1.3.v20160715.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-core_2.12-5.6.6.jar:/opt/cortex/lib/com.sksamuel.exts.exts_2.12-1.44.0.jar:/opt/cortex/lib/org.typelevel.cats_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-macros_2.12-0.9.0.jar:/opt/cortex/lib/com.github.mpilquist.simulacrum_2.12-0.10.0.jar:/opt/cortex/lib/org.typelevel.machinist_2.12-0.6.1.jar:/opt/cortex/lib/org.typelevel.cats-kernel_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-kernel-laws_2.12-0.9.0.jar:/opt/cortex/lib/org.scalacheck.scalacheck_2.12-1.13.4.jar:/opt/cortex/lib/org.scala-sbt.test-interface-1.0.jar:/opt/cortex/lib/org.typelevel.discipline_2.12-0.7.2.jar:/opt/cortex/lib/org.typelevel.catalysts-platform_2.12-0.0.5.jar:/opt/cortex/lib/org.typelevel.catalysts-macros_2.12-0.0.5.jar:/opt/cortex/lib/org.typelevel.cats-core_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-laws_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-free_2.12-0.9.0.jar:/opt/cortex/lib/org.typelevel.cats-jvm_2.12-0.9.0.jar:/opt/cortex/lib/org.apache.lucene.lucene-core-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-analyzers-common-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-backward-codecs-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-grouping-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-highlighter-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-join-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-memory-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-misc-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-queries-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-queryparser-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-sandbox-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-spatial-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-spatial-extras-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-spatial3d-6.6.1.jar:/opt/cortex/lib/org.apache.lucene.lucene-suggest-6.6.1.jar:/opt/cortex/lib/net.sf.jopt-simple.jopt-simple-5.0.2.jar:/opt/cortex/lib/com.carrotsearch.hppc-0.7.1.jar:/opt/cortex/lib/org.yaml.snakeyaml-1.15.jar:/opt/cortex/lib/com.fasterxml.jackson.dataformat.jackson-dataformat-smile-2.8.6.jar:/opt/cortex/lib/com.fasterxml.jackson.dataformat.jackson-dataformat-yaml-2.8.6.jar:/opt/cortex/lib/com.fasterxml.jackson.dataformat.jackson-dataformat-cbor-2.8.6.jar:/opt/cortex/lib/org.hdrhistogram.HdrHistogram-2.1.9.jar:/opt/cortex/lib/org.apache.logging.log4j.log4j-api-2.9.1.jar:/opt/cortex/lib/org.elasticsearch.jna-4.4.0-1.jar:/opt/cortex/lib/org.locationtech.spatial4j.spatial4j-0.6.jar:/opt/cortex/lib/com.vividsolutions.jts-1.13.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-streams_2.12-5.6.6.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-tcp_2.12-5.6.6.jar:/opt/cortex/lib/io.netty.netty-all-4.1.10.Final.jar:/opt/cortex/lib/org.elasticsearch.client.transport-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.plugin.transport-netty3-client-5.6.2.jar:/opt/cortex/lib/io.netty.netty-3.10.6.Final.jar:/opt/cortex/lib/io.netty.netty-buffer-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-codec-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-codec-http-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-common-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-handler-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-resolver-4.1.13.Final.jar:/opt/cortex/lib/io.netty.netty-transport-4.1.13.Final.jar:/opt/cortex/lib/org.elasticsearch.plugin.reindex-client-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.client.elasticsearch-rest-client-5.6.2.jar:/opt/cortex/lib/org.apache.httpcomponents.httpclient-4.5.2.jar:/opt/cortex/lib/org.apache.httpcomponents.httpcore-4.4.5.jar:/opt/cortex/lib/org.apache.httpcomponents.httpasyncclient-4.1.2.jar:/opt/cortex/lib/org.apache.httpcomponents.httpcore-nio-4.4.5.jar:/opt/cortex/lib/commons-logging.commons-logging-1.1.3.jar:/opt/cortex/lib/org.elasticsearch.plugin.lang-mustache-client-5.6.2.jar:/opt/cortex/lib/com.github.spullara.mustache.java.compiler-0.9.3.jar:/opt/cortex/lib/org.elasticsearch.plugin.percolator-client-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.plugin.parent-join-client-5.6.2.jar:/opt/cortex/lib/org.apache.logging.log4j.log4j-1.2-api-2.6.2.jar:/opt/cortex/lib/com.tdunning.t-digest-3.1.jar:/opt/cortex/lib/com.sksamuel.elastic4s.elastic4s-xpack-security_2.12-5.6.6.jar:/opt/cortex/lib/org.elasticsearch.client.x-pack-transport-5.6.2.jar:/opt/cortex/lib/org.elasticsearch.plugin.x-pack-api-5.6.2.jar:/opt/cortex/lib/com.unboundid.unboundid-ldapsdk-3.2.0.jar:/opt/cortex/lib/org.bouncycastle.bcprov-jdk15on-1.58.jar:/opt/cortex/lib/org.bouncycastle.bcpkix-jdk15on-1.55.jar:/opt/cortex/lib/com.googlecode.owasp-java-html-sanitizer.owaspjava-html-sanitizer-r239.jar:/opt/cortex/lib/com.sun.mail.javax.mail-1.5.3.jar:/opt/cortex/lib/javax.activation.activation-1.1.jar:/opt/cortex/lib/org.elasticsearch.client.elasticsearch-rest-client-sniffer-5.6.2.jar:/opt/cortex/lib/net.sf.supercsv.super-csv-2.4.0.jar:/opt/cortex/lib/org.scalactic.scalactic_2.12-3.0.5.jar:/opt/cortex/lib/com.floragunn.search-guard-ssl-5.6.9-23.jar:/opt/cortex/lib/org.elasticsearch.plugin.transport-netty4-client-5.6.9.jar:/opt/cortex/lib/org.elasticsearch.elasticsearch-5.6.9.jar:/opt/cortex/lib/org.elasticsearch.securesm-1.2.jar:/opt/cortex/lib/org.reflections.reflections-0.9.11.jar:/opt/cortex/lib/org.javassist.javassist-3.21.0-GA.jar:/opt/cortex/lib/net.lingala.zip4j.zip4j-1.3.2.jar:/opt/cortex/lib/org.thehive-project.cortex-2.1.3-1-assets.jar play.core.server.ProdServerStart
I will run the 'continuous monitoring' command you've outlined above and post the details here after a few hours.
htop seems to mix up threads and processes. htop shows you each tread, ps only shows processes.
I were just reading http://ask.xmodulo.com/view-threads-process-linux.html on that issue.
To hide the processes threads in htop, press H
But what the hell does htop mean with uniq PIDs per thread ... stupid thing.
Yeah, but at this point I'd rather exclude evidence from htop as it seems to be manufacturing what we think are issues. This way the thread doesn't turn into 'how to use htop' haha
For now I'll continue to monitor resource usage over time using ps
Just an idea. using ps -T -o "vsz rss comm" -p {PID}
, you can monitor the number of background processes as well as the memory consumption. You'd potentially have to run it twice, once with the PID of TheHive and once with Cortex (except you're sure it's a cortex problem).
Good luck tracing.
You could also attach jconsole to the jvm and run analysis there, but I do not know more about how to do this. I only now of the existance of that possibility.
Oh, and of course: track the output of free
. This way you get the proof you run out of memory .
No failures, but do have a steady increase in RSS usage by Cortex, and a clear decrease in Available memory reported by 'free'
Time | VSZ | RSS | Available Mem | Note | |
---|---|---|---|---|---|
9:00 | 8019672 | 1158980 | 7989028 | After running for hours | |
10:00 | 7942628 | 913624 | 11327040 | post-reboot | |
11:00 | 7949512 | 1224654 | 10123448 | ||
12:00 | 7959724 | 1569408 | 9132816 | ||
13:00 | missed | missed | missed | ||
14:00 | 7959724 | 1802904 | 8798216 | ||
15:00 | |||||
16:00 | |||||
17:00 |
I'm going to let it keep going, and my expectation is that it will fail once the server runs out of memory, causing the services to crash
See attached images. I have 2 graphs of 'free' data. 5 days(top) and 4 hours(bottom)
The 4hr image confirms a stead increase in memory usage, as well as cache.
The 5dy image shows a few service restarts, and a flatline where cortex was disabled, but Elastic+TheHive were both enabled, followed by a crash for multiple hours, then today's data.
I think this definitely confirms some sort of memory leak in Cortex.service. Now we just need to know how to prevent it.
Sorry, I disagree with you.
As long as your machine has free memory, it's totaly normal that (a) you cache grows and (b) rss tends to reach vsz. This is normal and expected.
From your graphs, especially the lower one, you can clearly see that used
is stable, which means, the memory allocated by processes does not grow. Good. And you see that cache
replaces free
, which is normal for filesystem caching.
Free memory is used as buffer cache (filesystem blocks read or written are held in memory as long as possible to avoid re-reading them from the disk, which fills the buffer cache at the cost of free
over the time the system is running).
And vsz is the virtual process size, whereas rss
is the amount of vss
that is actually kept in RAM. Ideally, vsz
= rss
, which means the entire virtual address space of the process is in memory.
Keep on measuring. You should expect free to shrink to a certain value and not beyond this. If your used
is stable, you do not have a memory leak. If you had a leak, used
would fill up the entire memory of the host, and then you would also see by tracking vsz
, which process grows.
I hope that helps. Please provide memory charts (those are great!) at the time your system is in trouble.
BTW, I have some trouble with the upper chart, because it does not match the data from the lower one. Both cover the time frame from ~10.00 to 14.00, but they do not match. Do you have an explaination for that?
Thinking ahead to mitigate your problems.
The JVMs of thehive and cortex run the default settings for the heap size. These are set using the -Xms
and -Xmx
settings of a JVM (Source)
To determine the default settings, would you please run
java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize'
?
and would you please also provide the settings configured for ElasticSearch? If you did'nt tweak the settings, you will probably have 2GB for ES:
root@hive:~# ps -ef|grep ^elastic
elastic+ 1234 1 0 Aug13 ? 00:10:24 /usr/bin/java -Xms2g -Xmx2g [...]
With this as a starting point, and assuming you have a dedicated thehive/cortex server, we could think about increasing the heap size for cortex by setting -Xmx
to a higher value than default.
I think about [Host Memory] - 2 GB (for OS) - [ ES Memory] - [JVM default max heap size for TheHive], and take the rest for cortex.
But let me be clear on this right away: I'm no Java specialist, much more an OS guy. Tuning the JVM's memory is more an educated guess.
More questions: do you havily use cortex? How many analyzer runs do you do each day, what is the setting of cache.job
in /etc/cortex/application.conf
?
I think you might analyze heavily, with a long caching time (global or per analyzer). That would fill up cortex's memory. Decreasing the caching time then could help, as well as increasing the JVMs heap size as before)
TIL my working knowledge of RAM was pretty far off. Thanks for the detailed explanation!! I'll answer your questions in order:
Graphs don't match:
The top graph covers Aug 9 00:00 - Aug 13 14:30 Bottom graph covers Aug 13 10:00 - Aug 13 14:30
The small sliver at the end of the top graph is the same data as the bottom graph. Bottom graph is just exploded to be easier to view.
FYI charts were created by using Elasticsearch, Kibana and beats agents (all free!). We happen to be doing a POC, and was using our hive server as one of the test devices, hence accidentally having days of data on hand.
The results of Java config:
ckinsel@hiveapp1:~$ java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize' intx CompilerThreadStackSize = 0 {pd product} uintx ErgoHeapSizeLimit = 0 {product} uintx HeapSizePerGCThread = 87241520 {product} uintx InitialHeapSize := 264241152 {product} uintx LargePageHeapSizeThreshold = 134217728 {product} uintx MaxHeapSize := 4206886912 {product} intx ThreadStackSize = 1024 {pd product} intx VMThreadStackSize = 1024 {pd product} openjdk version "1.8.0_222" OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
The results of Elastic config: (Default 2gb)
elastic+ 1326 1 1 Aug13 ? 00:14:55 /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Edefault.path.logs=/var/log/elasticsearch -Edefault.path.data=/var/lib/elasticsearch -Edefault.path.conf=/etc/elasticsearch
Cortex usage (low)
We're still relatively green with cortex, and run less than 50 analyzers per day, probably closer to 20. And those mostly consist of Virustotal\Urlscan lookups.
Cortex cache.job value (default)
## Cache # # If an analyzer is executed against the same observable, the previous report can be returned without re-executing the # analyzer. The cache is used only if the second job occurs within cache.job (the default is 10 minutes). cache.job = 10 minutes
Alright, I'll continue to monitor Free data, the same manual chart for cortex.service, and more of those elasticsearch charts; Specifically looking to see if Used memory grows, or stays steady.
Also, haven't had a cortex\thehive crash in about 48 hours. So it's either due, or has somehow solved itself.
Graph 3:
This data is a continuation of graph 2, and covers Aug 13 10:00 - 20:00. The cutoff happens because we're changing some config settings on the logging agents which make the graphs possible, but doesn't correspond to any issue on the server related to cortex.
It does show that Cache
eventually used up all of Free
memory, but also shows Used memory
slowly increase as well.
Unfortunately we needed to reboot the Hive server to apply patches, which has essentially reset our clock. I'll continue to monitor and paste results here when we eventually get another cortex failure.
I agree on your interpretation of the graph. You see free
runs down to a minumal value which is kept, and buffer cache is primarly responsible for that. And yes, obvisiously the used
mem grows.
A few words on the Java memory stuff. The default value for maxheapsize
is 4 GB. I understand this as the JVM will not go beyond this point. 4GB for Hive + 4 GB for Cortex + 2 GB for Elastic leaves 6 GB for all the rest, which is plenty of RAM.
The default setting for caching the analzer results is good, in combination with 20 or 50 analyses per day, so I don't think excessive caching is the reason for Cortex to run into OutOfMemoryError: Java heap space
.
I went back to the beginning of the tread an re-read what your problem is. To me it looks as Cortex's and TheHive's JVMs run out of memory. There are two possibilities for that:
I think the latter is the case, so you should continue recording the per-process memory allocation.
I would recommend running a cron job which records for every process ID of the host the vsz
and rss
. The goal is to find the process whose vsz
and/or rss
grows beyond any limit.
Thank you for sticking with me through this point. I'll create a cron job to measure this to attempt to find this rogue process, but I'm still thinking it's cortex.
In the very first graph I attached, the memory flatline was at a time when the server was operating normally, except with the cortex service disabled. Once I re-enabled cortex, the memory started a constant downward slope.
In theory, if I were to disable cortex, and not see 'used' continuously grow over time, then re-enable and see constant growth, I've identified the process correct?
Below image is Aug 14 8:00-Aug 15 8:00
I am going to write the cron job, and disable cortex again for 2 hours, then re-enable. If we see a flatline, then a steady decrease again after being enabled, we have the culprit.
I belive your assumption is right. If you shut down cortex, and yo have a steady used
, you actually identified a growing process. The question is: is this growth normal?
I believe it's abnormal, as the 'out-of-memory' issue has not been observed by us in the past, and we've not made any configuration changes to TheHive\Cortex for months. Unfortunately my system resource usage statistics started logging after we first noticed the issue, so I'm unable to look back at my actual performance metrics.
Actually, you would be a good candidate to benchmark against, since you have a healthy cortex instance. How much memory does your Cortex.service utilize while running normally? Does it grow over hours then settle around 2\4 Gb? Or stay fairly constant after starting?
Notable timestamps: 00:00 - 08:00: Steady increase in memory usage 08:00: Stop cortex.service 08:00 - 10:30: Steady Memory usage 10:30: Start cortex.service 10:30 - 14:00: Steady increase in memory usage 14:00: Restart TheHive.service and Elasticsearch.service 14:00 - 03:00 Steady increase in memory usage
This pegs the memory growth to cortex.service. The other java processes were executing normally when cortex was down, with no growth. Then immediately after enabling it again, it began again.
As you've stated, it's possible that Cortex.service memory usage growing toward the Java heap size is normal, so I'm still interested in comparison with a healthy cortex environment.
I only have Cortex/Hive on an unmonitored test server, and there is only very little done with it. As said, it's a test server.
I will setup a memory reporting like yours, but I'm not to optimistic we'd see something of interest.
OK, here are a few hours of memory allocation of my test machine, starting at 8.15am and ending 1.15pm
The machine hosts ElasticSearch 5.6, Hive 3.4.0-RC2, Cortex 3.0.0-RC4 and MISP (with it's database) The first little spike was a restart of Cortex, than the machine was idle for quite some time. At the end, I ran about 350 analyzes on observables and looked them up in MISP.
I cannot tell if the growth of used
is due to caching in Cortex, or database caching from MISP. So this was probably not a well-thought analyzer I ran :-(
You can't see much, because I did not track vsz
and rss
for each of Hive/Cortex/ES/MySQL, but you can see when the machine sits idle, without any user doing something, it does not leak memory. There are sparks, but they return to their original value.
I now additionally run the following data collector for the memory consumption of Cortex, TheHive and ES:
#!/bin/bash
TEMP=`mktemp`
ps -e -o "user vsz rss" | egrep 'elastic|thehive|cortex' | sort >$TEMP
awk '{printf("%d\t%d\t",$2,$3)}' < $TEMP
echo
rm $TEMP
running it gives the following output:
root@hive:~/bin# ./ppmemcollect
5905540 633468 6071136 2678580 5905792 641388
The first two columns are vsz
and rss
for Cortex, the next two for ES, and the last two are TheHive. I'll let this run for 24h and see how memory consumption of the JVMs work.
Very interesting. Your chart is totally what I would expect from the system after being online for a while.
You also are running a newer version of TheHive, and much newer version of Cortex. I'm going to clone and test these updates to see if maybe this is addressed by one of the updates, and would like to also test increasing the Heap size to 6\8gb instead.
I would really stick with the default values. These fit very well on my 8GB machine, so they would certainly match your host too. You could increase memory to tune your installation when your problems are gone, but better not now.
And from the lower chart you see that more memory will likely do nothing, as RSS is significant lower than VSZ. The only benificial tuning I see could be increasing memory for elasticsearch if you have a huge database.
However ...
Here are the newest charts.
14:00 - Cortex Restart 15:00 - DNS lookups for 390 IP-Addresses 23:00 - Cortex Restart 0.00 up to recent - every minute, one random IP address is looked up.
And the memory consumption looks beautyful:
Now the virtual process sizes (vsz) and the actual memory usage (rss) for cortex, ES and TheHive:
Everything smooth and flat, except for Cortex, which has a very low RSS at the beginning and growing to a somewhat fixed value.
Haha those charts are very nice.
Last 24 hours:
Summary, memory usage increases without fail. TheHive\Cortex sees almost 0 usage after-hours, yet still seeing the increase.
This morning, none of the services have failed:
ckinsel@hiveapp1:~$ sudo systemctl status thehive.service
[sudo] password for ckinsel:
● thehive.service - TheHive
Loaded: loaded (/usr/lib/systemd/system/thehive.service; enabled; vendor pres
Active: active (running) since Mon 2019-08-19 08:37:52 EDT; 23h ago
Docs: https://thehive-project.org
Main PID: 13657 (java)
Tasks: 73
Memory: 4.4G
CPU: 19min 11.439s
CGroup: /system.slice/thehive.service
└─13657 java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/appli
Warning: Journal has been rotated since unit was started. Log output is incomple
lines 1-12/12 (END)
ckinsel@hiveapp1:~$ sudo systemctl status cortex.service
● cortex.service - cortex
Loaded: loaded (/etc/systemd/system/cortex.service; enabled; vendor preset: e
Active: active (running) since Mon 2019-08-19 08:51:40 EDT; 23h ago
Docs: https://thehive-project.org
Main PID: 17618 (java)
Tasks: 47
Memory: 3.4G
CPU: 2h 43min 31.997s
CGroup: /system.slice/cortex.service
└─17618 java -Duser.dir=/opt/cortex -Dconfig.file=/etc/cortex/applica
Warning: Journal has been rotated since unit was started. Log output is incomple
lines 1-12/12 (END)
ckinsel@hiveapp1:~$ sudo systemctl status elasticsearch.service
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendo
Active: active (running) since Wed 2019-08-14 09:08:45 EDT; 5 days ago
Docs: http://www.elastic.co
Main PID: 1331 (java)
Tasks: 78
Memory: 2.5G
CPU: 1h 27min 18.434s
CGroup: /system.slice/elasticsearch.service
└─1331 /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSIni
Warning: Journal has been rotated since unit was started. Log output is incomple
ckinsel@hiveapp1:~$ ps -e -o "user vsz rss" | egrep 'elastic|thehive|cortex'
elastic+ 6932032 2676004
thehive 8151512 4634072
cortex 7969652 3682164
At this point I'm really not sure what direction to go. I'll track vsz and rss for thehive+cortex after the next crash (from fresh) so that I can compare against your described observation above .
I know for sure that by stoping cortex.service I immediately stop growing in memory usage, so whatever growth is happening, it is definitely related to cortex running somehow.
If you monitor vsz
and rss
, I'd expect to see rss
growing towards vsz
, until it reaches it's limit and then, let me guess, Cortex crashes. Based on the assumption that Cortex is the candidate - as you said before.
I believe your assumption, but as an IT pro I can only be sure when I saw vsz
and rss
logs. :-)
From my point of view, you have various options, with various effords.
1) change your JVM. If you run openJVM, switch to Oracle JVM and vice versa, to proof the JVM is guilty/not guilty. Maybe the latest update to your JVM introduced a memory leak. This would be my favorite. 2) install jconsole and use JVM-internal tools for monitoring memory consumption (there, I'm totally out of business. I have no clue how this works) 3) upgrade your Cortex. The 3.0.0-RC4 I run did not make any problem for me, but it's only a test environment. Your mileage may vary when you're in production. 4) Spill fuel over your server and light up the whole stuff. This will probably fix the issue for quite a long time - at least for you-, depending on your local laws and the period of imprisonment you will get for firestarting. No, wait ... don't do it!
Working on the vsz\rss stats now. For reference, our java -version output (spoiler, its open JDK):
ckinsel@hiveapp1:~$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
As for 4. , I do have my red stapler with me...
Well. openjdk version "1.8.0_222"
? Version one-dot-eight???
I run
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)
Running sudo update-alternatives --config java
reports that OpenJDK 8 headless is actually installed.
We've updated to openjdk 11 to match yours, and started everything. A very early look does not seem to show measurable RAM usage over time so far. I'll be out of office until Monday. If we are able to run with no issues until then, I'll consider this issue solved, with updating Java being the solution.
So, installing the newer openjdk 11 definitely changed the situation. TheHive+Cortex have been executing for 5 days with no crashes, and very minor Memory usage increasing over time. (Somewhat expected as there is still quite a large pool of free memory)
Below graph represents 5 days of data
I'm going to let things execute, and if I get no issues, will resolve this case by the end of the week.
@cameronkinsel , can this issue be closed?
@github-pba I have been working with cameron on this. After updating Java the server no longer crashes or see slow/rapid loss of memory. Yet now it has started consuming so much CPU time that everything just stops running. As you can see below we are just maxing out the CPU with TheHive. We ensured that all patches were installed yesterday and even upgraded TheHive to version: 3.4.0-1. Any thoughts?
james.cribbs@hiveapp1:~$ sudo systemctl status thehive.service
[sudo] password for james.cribbs:
Sorry, try again.
[sudo] password for james.cribbs:
● thehive.service - TheHive
Loaded: loaded (/usr/lib/systemd/system/thehive.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2019-09-12 18:06:06 UTC; 1h 15min ago
Docs: https://thehive-project.org
Main PID: 26705 (java)
Tasks: 56
Memory: 797.6M
CPU: 4h 7min 39.294s
CGroup: /system.slice/thehive.service
└─26705 java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logback.xml -Dpidfile.path=/dev/null -cp /opt/thehive/lib/../conf/:/opt/thehive/lib/org.thehive-project.thehive-3.4.0-1
Sep 12 18:06:06 hiveapp1 systemd[1]: Started TheHive.
ps -eo pcpu,pid,user,args | sort -k1 -r -n | head -10
323 26705 thehive java -Duser.dir=/opt/thehive -Dconfig.file=/etc/thehive/application.conf -Dlogger.file=/etc/thehive/logback.xml -Dpidfile.path=/dev/null -cp /opt/thehive/lib/../conf/:/opt/thehive/lib/org.thehive-project.thehive-3.4.0-1-sans-externalized.jar:/opt/thehive/lib/org.thehive-project.thehivebackend-3.4.0-1.jar:/opt/thehive/lib/org.thehive-project.thehivemisp-3.4.0-1.jar:/opt/thehive/lib/org.thehive-project.thehivecortex-3.4.0-1.jar:/opt/thehive/lib/org.scala-lang.scala-library-2.12.6.jar:/opt/thehive/lib/com.typesafe.play.twirl-api_2.12-1.3.15.jar:/opt/thehive/lib/org.scala-lang.modules.scala-xml_2.12-1.0.6.jar:/opt/thehive/lib/com.typesafe.play.play-server_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.build-link-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-exceptions-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-netty-utils-2.6.23.jar:/opt/thehive/lib/org.slf4j.slf4j-api-1.7.25.jar:/opt/thehive/lib/org.slf4j.jul-to-slf4j-1.7.25.jar:/opt/thehive/lib/org.slf4j.jcl-over-slf4j-1.7.25.jar:/opt/thehive/lib/com.typesafe.play.play-streams_2.12-2.6.23.jar:/opt/thehive/lib/org.reactivestreams.reactive-streams-1.0.2.jar:/opt/thehive/lib/com.typesafe.akka.akka-stream_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.akka.akka-actor_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.config-1.3.3.jar:/opt/thehive/lib/org.scala-lang.modules.scala-java8-compat_2.12-0.8.0.jar:/opt/thehive/lib/com.typesafe.akka.akka-protobuf_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.ssl-config-core_2.12-0.3.7.jar:/opt/thehive/lib/com.typesafe.akka.akka-slf4j_2.12-2.5.21.jar:/opt/thehive/lib/com.fasterxml.jackson.datatype.jackson-datatype-jdk8-2.8.11.jar:/opt/thehive/lib/com.fasterxml.jackson.datatype.jackson-datatype-jsr310-2.8.11.jar:/opt/thehive/lib/commons-codec.commons-codec-1.11.jar:/opt/thehive/lib/com.typesafe.play.play-json_2.12-2.6.12.jar:/opt/thehive/lib/com.typesafe.play.play-functional_2.12-2.6.12.jar:/opt/thehive/lib/org.scala-lang.scala-reflect-2.12.6.jar:/opt/thehive/lib/org.typelevel.macro-compat_2.12-1.1.1.jar:/opt/thehive/lib/joda-time.joda-time-2.9.9.jar:/opt/thehive/lib/org.checkerframework.checker-compat-qual-2.0.0.jar:/opt/thehive/lib/com.google.errorprone.error_prone_annotations-2.1.3.jar:/opt/thehive/lib/com.google.j2objc.j2objc-annotations-1.1.jar:/opt/thehive/lib/org.codehaus.mojo.animal-sniffer-annotations-1.14.jar:/opt/thehive/lib/io.jsonwebtoken.jjwt-0.7.0.jar:/opt/thehive/lib/javax.xml.bind.jaxb-api-2.3.1.jar:/opt/thehive/lib/javax.activation.javax.activation-api-1.2.0.jar:/opt/thehive/lib/org.apache.commons.commons-lang3-3.6.jar:/opt/thehive/lib/javax.transaction.jta-1.1.jar:/opt/thehive/lib/javax.inject.javax.inject-1.jar:/opt/thehive/lib/com.typesafe.play.filters-helpers_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-logback_2.12-2.6.23.jar:/opt/thehive/lib/ch.qos.logback.logback-classic-1.2.3.jar:/opt/thehive/lib/ch.qos.logback.logback-core-1.2.3.jar:/opt/thehive/lib/com.typesafe.play.play-akka-http-server_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.akka.akka-http-core_2.12-10.0.15.jar:/opt/thehive/lib/com.typesafe.akka.akka-parsing_2.12-10.0.15.jar:/opt/thehive/lib/org.apache.logging.log4j.log4j-to-slf4j-2.9.1.jar:/opt/thehive/lib/com.typesafe.play.play-ehcache_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-cache_2.12-2.6.23.jar:/opt/thehive/lib/net.sf.ehcache.ehcache-2.10.6.jar:/opt/thehive/lib/org.ehcache.jcache-1.0.1.jar:/opt/thehive/lib/javax.cache.cache-api-1.0.0.jar:/opt/thehive/lib/com.typesafe.play.play-ws_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-ws-standalone_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-ws-standalone-xml_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-ws-standalone-json_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-ahc-ws_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.play.play-ahc-ws-standalone_2.12-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.cachecontrol_2.12-1.1.4.jar:/opt/thehive/lib/org.scala-lang.modules.scala-parser-combinators_2.12-1.1.0.jar:/opt/thehive/lib/org.joda.joda-convert-1.9.2.jar:/opt/thehive/lib/com.typesafe.play.shaded-asynchttpclient-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.shaded-oauth-1.1.13.jar:/opt/thehive/lib/com.typesafe.play.play-guice_2.12-2.6.23.jar:/opt/thehive/lib/aopalliance.aopalliance-1.0.jar:/opt/thehive/lib/com.google.inject.extensions.guice-assistedinject-4.1.0.jar:/opt/thehive/lib/net.codingwell.scala-guice_2.12-4.2.3.jar:/opt/thehive/lib/com.google.inject.guice-4.2.2.jar:/opt/thehive/lib/com.google.guava.guava-25.1-android.jar:/opt/thehive/lib/com.google.code.findbugs.jsr305-3.0.2.jar:/opt/thehive/lib/org.thehive-project.elastic4play_2.12-1.11.5.jar:/opt/thehive/lib/com.typesafe.play.play-akka-http2-support_2.12-2.6.23.jar:/opt/thehive/lib/com.typesafe.akka.akka-http2-support_2.12-10.0.15.jar:/opt/thehive/lib/com.twitter.hpack-1.0.2.jar:/opt/thehive/lib/org.eclipse.jetty.alpn.alpn-api-1.1.3.v20160715.jar:/opt/thehive/lib/com.sksamuel.elastic4s.elastic4s-core_2.12-6.5.1.jar:/opt/thehive/lib/com.sksamuel.exts.exts_2.12-1.60.0.jar:/opt/thehive/lib/com.fasterxml.jackson.core.jackson-core-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.core.jackson-databind-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.module.jackson-module-scala_2.12-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.core.jackson-annotations-2.9.6.jar:/opt/thehive/lib/com.fasterxml.jackson.module.jackson-module-paranamer-2.9.6.jar:/opt/thehive/lib/com.thoughtworks.paranamer.paranamer-2.8.jar:/opt/thehive/lib/com.sksamuel.elastic4s.elastic4s-http-streams_2.12-6.5.1.jar:/opt/thehive/lib/com.sksamuel.elastic4s.elastic4s-http_2.12-6.5.1.jar:/opt/thehive/lib/org.elasticsearch.client.elasticsearch-rest-client-6.5.2.jar:/opt/thehive/lib/org.apache.httpcomponents.httpclient-4.5.2.jar:/opt/thehive/lib/org.apache.httpcomponents.httpcore-4.4.5.jar:/opt/thehive/lib/org.apache.httpcomponents.httpasyncclient-4.1.2.jar:/opt/thehive/lib/org.apache.httpcomponents.httpcore-nio-4.4.5.jar:/opt/thehive/lib/commons-logging.commons-logging-1.1.3.jar:/opt/thehive/lib/org.scalactic.scalactic_2.12-3.0.5.jar:/opt/thehive/lib/org.bouncycastle.bcprov-jdk15on-1.58.jar:/opt/thehive/lib/net.lingala.zip4j.zip4j-1.3.2.jar:/opt/thehive/lib/org.reflections.reflections-0.9.11.jar:/opt/thehive/lib/org.javassist.javassist-3.21.0-GA.jar:/opt/thehive/lib/com.typesafe.akka.akka-cluster_2.12-2.5.21.jar:/opt/thehive/lib/com.typesafe.akka.akka-remote_2.12-2.5.21.jar:/opt/thehive/lib/io.netty.netty-3.10.6.Final.jar:/opt/thehive/lib/io.aeron.aeron-driver-1.15.1.jar:/opt/thehive/lib/org.agrona.agrona-0.9.31.jar:/opt/thehive/lib/io.aeron.aeron-client-1.15.1.jar:/opt/thehive/lib/com.typesafe.akka.akka-cluster-tools_2.12-2.5.21.jar:/opt/thehive/lib/org.thehive-project.thehive-3.4.0-1-assets.jar play.core.server.ProdServerStart
1.6 1322 elastic+ /usr/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Edefault.path.logs=/var/log/elasticsearch -Edefault.path.data=/var/lib/elasticsearch -Edefault.path.conf=/etc/elasticsearch
Decided I wanted to dig deeper into this issue myself. I found this article that describes how to extract the thread causing problems in a Java process. Below is the result of this investigation.
Determine the process ID (PID) of the affected server process using the following command
james.cribbs@hiveapp1:~$ top
[...]
29519 thehive 20 0 8054624 0.996g 28720 S 395.0 6.4 575:03.64 java
[...]
Determine which thread in the PID identified in step 1 is consuming the CPU
james.cribbs@hiveapp1:~$ top -n 1 -H -p 29519
top - 17:10:59 up 2 days, 23:55, 2 users, load average: 8.60, 8.09, 8.02 Threads: 62 total, 9 running, 53 sleeping, 0 stopped, 0 zombie %Cpu(s): 70.6 us, 0.5 sy, 0.0 ni, 28.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16432288 total, 6473660 free, 5272176 used, 4686452 buff/cache KiB Swap: 999420 total, 999420 free, 0 used. 10698876 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29607 thehive 20 0 8054624 0.996g 28776 R 68.8 6.4 47:26.80 application-akk
29605 thehive 20 0 8054624 0.996g 28776 R 50.0 6.4 101:52.21 application-akk
29606 thehive 20 0 8054624 0.996g 28776 R 43.8 6.4 47:54.49 application-akk
29616 thehive 20 0 8054624 0.996g 28776 R 43.8 6.4 103:59.84 application-akk
29625 thehive 20 0 8054624 0.996g 28776 R 43.8 6.4 101:35.80 application-akk
29627 thehive 20 0 8054624 0.996g 28776 R 37.5 6.4 47:28.03 application-akk
29628 thehive 20 0 8054624 0.996g 28776 R 37.5 6.4 250:11.96 application-akk
29626 thehive 20 0 8054624 0.996g 28776 R 31.2 6.4 100:51.50 application-akk
29519 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.02 java
29581 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:10.00 java
29582 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.41 GC Thread#0
29583 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 G1 Main Marker
29584 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:01.28 G1 Conc#0
29585 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 G1 Refine#0
29586 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:26.43 G1 Young RemSet
29587 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 1:37.96 VM Thread
29588 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.04 Reference Handl
29589 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.31 Finalizer
29590 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 Signal Dispatch
29591 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 3:22.20 C2 CompilerThre
29592 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:28.79 C1 CompilerThre
29593 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 1:05.23 Sweeper thread
29594 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 Service Thread
29595 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:46.83 VM Periodic Tas
29596 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.07 Common-Cleaner
29597 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.59 GC Thread#1
29598 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.33 GC Thread#2
29599 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:51.37 GC Thread#3
29601 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.02 AsyncAppender-W
29602 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.02 AsyncAppender-W
29604 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 3:52.67 application-sch
29610 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.33 New I/O worker
29611 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.33 New I/O worker
29612 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.95 New I/O boss #3
29613 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.41 New I/O worker
29614 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:05.51 New I/O worker
29615 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 New I/O server
29617 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 DEFAULT
29619 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:03.55 pool-1-thread-1
29620 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:04.18 I/O dispatcher
29621 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:04.71 I/O dispatcher
29622 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:26.99 I/O dispatcher
29623 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:21.43 I/O dispatcher
29624 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.08 Statistics Thre
29629 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:25.06 pool-2-thread-1
29630 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:24.85 pool-3-thread-1
29631 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.00 com.google.comm
29632 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:25.85 pool-4-thread-1
29633 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:25.12 pool-5-thread-1
29634 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.55 ObjectCleanerTh
29635 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:04.02 AsyncHttpClient
29636 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:07.75 AsyncHttpClient
29639 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:09.21 application-akk
27420 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.48 AsyncHttpClient
27459 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.43 AsyncHttpClient
27478 thehive 20 0 8054624 0.996g 28776 S 0.0 6.4 0:00.43 AsyncHttpClient
3. Produce a stack trace for the PID identified in step 1 using the following command:
james.cribbs@hiveapp1:~$ sudo jstack 29519 > jstack-output-29519.txt
4. Convert the thread ID identified as problematic in step 2 to a hex value.
`29607` to `73A7`
5. Search the stack trace output for this hex value using grep. You are looking for a thread nid that matches this hex value:
james.cribbs@hiveapp1:~$ cat jstack-output-29519.txt | grep -i 73A7 "application-akka.actor.default-dispatcher-4" #17 prio=5 os_prio=0 cpu=2467719.98ms elapsed=76490.62s tid=0x00007fe9d1c5b800 nid=0x73a7 runnable [0x00007fe9715de000]
According to the end of that post it looks like the issue is related to whatever `application-akka.actor.default-dispatcher-4` is doing.
Cameron, James,
thank you for sharing your thoughts and your investigation results here. I'm sad you/cameron traded one problem for another.
I'm on vacation the next two weeks, so I can provide no support. However, you dug so deep, I don't think I can help any further. But one more idea:
You have a process running as wild as possible, and we know that's uncommon (as there are probably hundrets of installations working well). I would now use strace
to trace the system calls the JVM executes when TheHive sits idle. With some luck, you can track down what the JVM tries do to, fails, and immediately retries. Maybe you can track it to network activity and relate it to some special network hardening you do for your servers, or something like this.
There are statistics and profiling options for strace (see man strace
), this will quickly show you what OS calls are done on your machine. If you're unlucky, the JVM spins in a circle outside of system calls.
I hope you'll have success, and I will probably follow the thread when I'm back from my holidays.
Travel safe and thanks for sticking with us for so long! We'll post data here if we are able to find the root cause.
When your server no longer works, can you still ping and login, or is the system so unresponsive you have to reset it?
I have trouble with a mail server which occasionally saturates one of four cores and does no longer react on anything. No ping, no login, no fun. This happens since I upgraded the OS. Maybe there is something with the kernel ...?
When it bottoms out we can’t use thinks like systemctl stop thehive.service to shut it down. It requires kill -9 [pid]. This means that if the service is hung restarting the box hangs during shutdown and it requires a hard reset. Haven’t seen it lock up to the point ping doesn’t work. We will take that trace command for a spin and keep working. Once again thanks for your help!
On Sat, Sep 14, 2019 at 12:44 github-pba notifications@github.com wrote:
When your server no longer works, can you still ping and login, or is the system so unresponsive you have to reset it?
I have trouble with a mail server which occasionally saturates one of four cores and does no longer react on anything. No ping, no login, no fun. This happens since I upgraded the OS. Maybe there is something with the kernel ...?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/TheHive-Project/Cortex/issues/214?email_source=notifications&email_token=AEU6LFQTDUJ4B5I6ZAZCCN3QJUIGPA5CNFSM4ILBQBUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6W7MYA#issuecomment-531494496, or mute the thread https://github.com/notifications/unsubscribe-auth/AEU6LFSRJG4ZF5ERKRS4B5TQJUIGPANCNFSM4ILBQBUA .
--
Memory leak issue is because of akka playframework http2alpnSupport module: https://github.com/akka/akka-http/issues/2462#issuecomment-703934326 I have provided detail explanation about heap dump.
Right now quick fix to disable http2.enabled=no in application.conf. @nadouani . we have run jemter test and did profiling. There is no memory leaks after this settings. Unfortunately http 1.1 is slower than Http2 but its ok rather than having memory leaks. Memory leaks usually occurred after 48k cortex jobs with 1.5 GB heap size. More detail about memory leaks and profiling result is above comment with link to akka-http.
Quick question: do you use SSL/HTTPS using Cortex without a reverse proxy?
Hi @nadouani , We do use SSL/HTTP but not reverse proxy. Cortex binary is bundled in our custom Docker file and deployed on Openshift Pod. so cortex is running in headless mode.
That's the reason of the memory leak then. This is a bug in Akka and that's why we don't recommend enabling SSL directly by Cortex: https://github.com/TheHive-Project/CortexDocs/blob/master/admin/admin-guide.md#https
We try to upgrade the version of playframework whenever possible, so if that issue is solved in Akka, we will fix it by upgrading Playframework to the version that supports the Akka fix.
Thank you @nadouani .Appreciate your response and guidance. Initially I have followed up the digital ocean SSL configuration guide was mentioned in cortex doc. now we will explore nginx. OR disable http2 and proceed further for a now.
Hello all!
We have a Linux 16.04 headless server which has been running TheHive\Cortex for over a year now with the following specs: 4 CPU cores, 16gb RAM, 400gb storage
Software:
Problem: After running for a while (roughly once per 48 hrs) Cortex.service will randomly utilize all resources on the server (See screenshot). This causes thehive.service and cortex.service to report failure in Systemctl, and the webapp goes down. Additionally, 'sudo systemctl restart cortex.service' hangs, and requires the VM to be reset to get it out of the stuck state.
Evidence: TheHive and Cortex logs (/var/log/thehive/application.log and /var/log/cortex/application.log) both contain many Java OutOfMemory errors: Thehive:
Cortex:
Systemctl:
Journalctl:
Recovery/Support After rebooting the VM, all services start with no issues, the CPU rests below 3%, and RAM below 4gb. This issue sounds like a potential memory leak, but I have not noticed it steadily climbing, and I don't think this explains the 100% CPU utilization. I'd consider our implementation pretty standard, but haven't found others reporting the same issue here or on github. I cannot identify when exactly the issue started, as we've not (knowingly) applied changes to the server which would cause this behavior, and it has been ongoing for roughly a week.
At this point I'm not exactly sure where to look next for more logs, or for a source for the issue. If anyone has any ideas, please let me know!