TheHive-Project / Cortex

Cortex: a Powerful Observable Analysis and Active Response Engine
https://thehive-project.org
GNU Affero General Public License v3.0
1.28k stars 218 forks source link

[BUG] Cortex is unresponsive if too much jobs #374

Open azgaviperr opened 3 years ago

azgaviperr commented 3 years ago

Request Type

Bug

Work Environment

Question Answer
OS version (server) Docker SWARM
OS version (client) Viperr,
Virtualized Env. True
Dedicated RAM 16 GB
vCPU 8
Cortex version / git hash 3.1.1
Package Type RPM, DEB, Docker, Binary, From source
Index type Elasticsearch
Attachments storage Local (GlusterFS)
Browser type & version Firefox and Chrome

Problem Description

When ending a big quantity of artefact to Cortex to get analyze by a few analyzers, cortex becam unresponsive. Front Page is blank while answering code 200 and it is impossible to get access or communication using API. At the end of all jobs that continue running, service is again available.

Issue is report are not sent back to thehive, you need to rerun analyzer and result is given directly (cached result)

Steps to Reproduce

  1. Add some artifact
  2. Run them to a big quantity of Analyzer
  3. Observate the unresponsivness

Possible Solutions

I did add this to the application.conf this helped in some case but not all.


akka {
  log-config-on-start = on

  actor {
    default-dispatcher {
      fork-join-executor {
        parallelism-max = 16
      }
      thread-pool-executor {
        fixed-pool-size = 16
      }
      throughput = 1
    }
    default-blocking-io-dispatcher {
      fork-join-executor {
        parallelism-max = 32
      }
      thread-pool-executor {
        fixed-pool-size = 32
      }
      throughput = 1
    }
  }
}

Complementary information

(add anything that can help identifying the problem such as log excerpts, screenshots, configuration dumps etc.)

D4rkw0lv3s commented 3 years ago

I'm facing the same issue, tried to use the possible solution and if more than 10 analyzers or a lot of artifacts, cortex become unresponsive. And the same thing as described by @azgaviperr happens to me in thehive.

mback2k commented 3 years ago

Duplicate of https://github.com/TheHive-Project/Cortex/issues/364 as far as I can tell, see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321 for a possible root cause.

danniranderis commented 3 years ago

We observe the same type of event. Happens for e.g. 60-90 total jobs spread over a handful analyzers.

In the application.log we see the below logs for the blank index-page

2021-06-28 14:49:52,920 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-377 - Authentication failure:
        session: AuthenticationError User session not found
        pki: AuthenticationError Certificate authentication is not configured
        key: AuthenticationError Authentication failure
        init: AuthenticationError Use of initial user is forbidden because users exist in database
2021-06-28 14:49:52,920 [INFO] from org.thp.cortex.services.ErrorHandler in application-akka.actor.default-dispatcher-377 - GET /api/job/cgGkUnoBTzYZDjIcEFjz/waitreport?atMost=1%20second returned 401
org.elastic4play.AuthenticationError: Authentication failure
        at org.elastic4play.controllers.Authenticated.$anonfun$getContext$4(Authenticated.scala:272)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:56)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:93)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:93)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
2021-06-28 14:49:53,931 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-360 - Authentication failure:
        session: AuthenticationError User session not found
        pki: AuthenticationError Certificate authentication is not configured
        key: AuthenticationError Authentication failure
        init: AuthenticationError Use of initial user is forbidden because users exist in database

Edit:

And for the Elasticsearch instance that is running on the same machine we get the following error that might be of interest in this case:

[2021-06-28T14:47:07,236][WARN ][r.suppressed             ] [cortex01] path: /cortex_6/_search, params: {scroll=60000ms, index=cortex_6}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:601) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:332) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:636) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:415) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:59) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:264) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:62) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:404) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:743) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1397) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1371) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:45) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:40) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.11.1.jar:7.11.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.
        at org.elasticsearch.search.SearchService.createAndPutReaderContext(SearchService.java:643) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:627) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:420) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.access$500(SearchService.java:135) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.11.1.jar:7.11.1]
        ... 6 more
nadouani commented 3 years ago

Hello guys,

can anyone of you, who has the issue, give us some insights about the number of observables? jobs? analyzers? which analyzers....

The error is clear:

org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.

So there is a limit that is reached here, and we need to know which one it is

D4rkw0lv3s commented 3 years ago

Sure,

It happen when running one observable through 16+ different analyzers or two observables with 10 different analyzers each.

Analyzers:

In real prod I will not use that much but if I have half of them configured and run more than 2 observables at same time it can't handle it.

mback2k commented 3 years ago

@nadouani I don't think the error is that clear, please also see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321.

azgaviperr commented 2 years ago

It's not so simple, sometimes it happen when using about 10 observables on one analyzer, sometimes it run okay. Most often it happen if run multiple (3) observables on multiple analyzer 10+ . And when anaylzers fails on error (like MISP not able to be requested) this seems to hit harder cortex.

image This make cortex to be unresponsive with only one ip address selected

mback2k commented 2 years ago

I think it is mostly related to saving artifacts returned by the analyzers into ES. That seems to fill up the connections to ES and make Cortex stuck. This can happen already with just a single analyzer running/finishing.

azgaviperr commented 2 years ago

My ES is used also for the hive index and while cortex is unavailable thehive continue to work correctly. This seems to be an issue Cortex side and maybe bad queuing of http request.

I had the issue also today with 1 obs run against the misp analyzer.

mback2k commented 2 years ago

Yes, with filling up the connections to ES I do exactly mean the Cortex HTTP connection pool and not the ES side. Our ES cluster is pretty big and does not even show any signs of an issue while Cortex is stuck. Also see the issue I linked above.

mback2k commented 2 years ago

I was able to workaround this issue finally by modifying cortexutils to not return any artifacts so that trying to store them in ES no longer fills up all the connections and threads. This is the change I made in /lib/python3.6/site-packages/cortexutils/analyzer.py and now our Cortex is stable again:

    def report(self, full_report, ensure_ascii=False):
        """Returns a json dict via stdout.

        :param full_report: Analyzer results as dict.
        :param ensure_ascii: Force ascii output. Default: False"""

        summary = {}
        try:
            summary = self.summary(full_report)
        except Exception:
            pass

        super(Analyzer, self).report({
            'success': True,
            'summary': summary,
            'artifacts': [], #self.artifacts(full_report), # WORKAROUND HERE!
            'full': full_report
        }, ensure_ascii)
azgaviperr commented 2 years ago

@mback2k Any possibe impact on this change except making in to works? Maybe when you need to import observables generated by analyzers ?

mback2k commented 2 years ago

Of course the artifacts won't be saved anymore, but this is a trade off I am willing to make currently.

nadouani commented 2 years ago

Thank you guys for your comments. I understand this is a blocker thing.

From @mback2k comments, the issue could be saving artifacts discovered by the jobs. Basically, @mback2k, you don't need to change cortexutils code as extracting the artifacts is an option that you can just disable by analyzer. If disabled, Cortex won't return any artifact from the job. Could you confim you have the option enabled?

mback2k commented 2 years ago

@nadouani I will check this on Monday, but I think the configuration only allows to adjust the automatic extraction of artifacts. If an analyzer provides artifacts on it's own, e.g. from a sandbox report, then the option won't have any effect.

Also the main root cause is still the requests to ES being handled in a FIFO fashion by the asynchronous akka system. If an analyzer job finishes with hundreds of artifacts, saving these to ES block all other kind of requests to ES, including user authentication. With at most 30 concurrent connections to an ES cluster (10 per host with a max. of 30 connections in the pool) this can take some time and quickly get's out of hand if a lot of jobs are being run.

nadouani commented 2 years ago

Yes, I now understand what your conclusion is. We will figure out how and when to fix that ;)

mback2k commented 2 years ago

Thanks a lot! I would propose introducing some kind of prioritization for the requests to ES. ES requests as part of a browser/API request should have a higher priority over background ES requests (like saving results of finished jobs). The later should probably be done in an unblocking background fashion anyway, e.g. background requests shouldn't be in the way of foreground requests. Just my two cents. ;-)

mback2k commented 2 years ago

@nadouani I just verified, we already had the global and per-analyzer setting like this: "auto_extract_artifacts":false, but this did not help with all analyzers as described above.

mback2k commented 2 years ago

@nadouani @To-om any update on fixing this issue? :eyes:

azgaviperr commented 2 years ago

Hello, still looking forward a fix for this issue.

mback2k commented 2 years ago

Yes, same here. @nadouani does StrangeBee provide paid support/development for issues like this? I would be interested.

azgaviperr commented 2 years ago

Hello, Any update on the matter ?