Open azgaviperr opened 3 years ago
I'm facing the same issue, tried to use the possible solution and if more than 10 analyzers or a lot of artifacts, cortex become unresponsive. And the same thing as described by @azgaviperr happens to me in thehive.
Duplicate of https://github.com/TheHive-Project/Cortex/issues/364 as far as I can tell, see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321 for a possible root cause.
We observe the same type of event. Happens for e.g. 60-90 total jobs spread over a handful analyzers.
In the application.log we see the below logs for the blank index-page
2021-06-28 14:49:52,920 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-377 - Authentication failure:
session: AuthenticationError User session not found
pki: AuthenticationError Certificate authentication is not configured
key: AuthenticationError Authentication failure
init: AuthenticationError Use of initial user is forbidden because users exist in database
2021-06-28 14:49:52,920 [INFO] from org.thp.cortex.services.ErrorHandler in application-akka.actor.default-dispatcher-377 - GET /api/job/cgGkUnoBTzYZDjIcEFjz/waitreport?atMost=1%20second returned 401
org.elastic4play.AuthenticationError: Authentication failure
at org.elastic4play.controllers.Authenticated.$anonfun$getContext$4(Authenticated.scala:272)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:56)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:93)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:93)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
2021-06-28 14:49:53,931 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-360 - Authentication failure:
session: AuthenticationError User session not found
pki: AuthenticationError Certificate authentication is not configured
key: AuthenticationError Authentication failure
init: AuthenticationError Use of initial user is forbidden because users exist in database
Edit:
And for the Elasticsearch instance that is running on the same machine we get the following error that might be of interest in this case:
[2021-06-28T14:47:07,236][WARN ][r.suppressed ] [cortex01] path: /cortex_6/_search, params: {scroll=60000ms, index=cortex_6}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:601) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:332) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:636) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:415) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:59) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:264) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:62) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:404) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:743) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1397) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1371) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:45) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:40) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.11.1.jar:7.11.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.
at org.elasticsearch.search.SearchService.createAndPutReaderContext(SearchService.java:643) ~[elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:627) ~[elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:420) ~[elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.search.SearchService.access$500(SearchService.java:135) ~[elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.11.1.jar:7.11.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.11.1.jar:7.11.1]
... 6 more
Hello guys,
can anyone of you, who has the issue, give us some insights about the number of observables? jobs? analyzers? which analyzers....
The error is clear:
org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.
So there is a limit that is reached here, and we need to know which one it is
Sure,
It happen when running one observable through 16+ different analyzers or two observables with 10 different analyzers each.
Analyzers:
In real prod I will not use that much but if I have half of them configured and run more than 2 observables at same time it can't handle it.
@nadouani I don't think the error is that clear, please also see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321.
It's not so simple, sometimes it happen when using about 10 observables on one analyzer, sometimes it run okay. Most often it happen if run multiple (3) observables on multiple analyzer 10+ . And when anaylzers fails on error (like MISP not able to be requested) this seems to hit harder cortex.
This make cortex to be unresponsive with only one ip address selected
I think it is mostly related to saving artifacts returned by the analyzers into ES. That seems to fill up the connections to ES and make Cortex stuck. This can happen already with just a single analyzer running/finishing.
My ES is used also for the hive index and while cortex is unavailable thehive continue to work correctly. This seems to be an issue Cortex side and maybe bad queuing of http request.
I had the issue also today with 1 obs run against the misp analyzer.
Yes, with filling up the connections to ES I do exactly mean the Cortex HTTP connection pool and not the ES side. Our ES cluster is pretty big and does not even show any signs of an issue while Cortex is stuck. Also see the issue I linked above.
I was able to workaround this issue finally by modifying cortexutils to not return any artifacts so that trying to store them in ES no longer fills up all the connections and threads. This is the change I made in /lib/python3.6/site-packages/cortexutils/analyzer.py
and now our Cortex is stable again:
def report(self, full_report, ensure_ascii=False):
"""Returns a json dict via stdout.
:param full_report: Analyzer results as dict.
:param ensure_ascii: Force ascii output. Default: False"""
summary = {}
try:
summary = self.summary(full_report)
except Exception:
pass
super(Analyzer, self).report({
'success': True,
'summary': summary,
'artifacts': [], #self.artifacts(full_report), # WORKAROUND HERE!
'full': full_report
}, ensure_ascii)
@mback2k Any possibe impact on this change except making in to works? Maybe when you need to import observables generated by analyzers ?
Of course the artifacts won't be saved anymore, but this is a trade off I am willing to make currently.
Thank you guys for your comments. I understand this is a blocker thing.
From @mback2k comments, the issue could be saving artifacts discovered by the jobs. Basically, @mback2k, you don't need to change cortexutils code as extracting the artifacts is an option that you can just disable by analyzer. If disabled, Cortex won't return any artifact from the job. Could you confim you have the option enabled?
@nadouani I will check this on Monday, but I think the configuration only allows to adjust the automatic extraction of artifacts. If an analyzer provides artifacts on it's own, e.g. from a sandbox report, then the option won't have any effect.
Also the main root cause is still the requests to ES being handled in a FIFO fashion by the asynchronous akka system. If an analyzer job finishes with hundreds of artifacts, saving these to ES block all other kind of requests to ES, including user authentication. With at most 30 concurrent connections to an ES cluster (10 per host with a max. of 30 connections in the pool) this can take some time and quickly get's out of hand if a lot of jobs are being run.
Yes, I now understand what your conclusion is. We will figure out how and when to fix that ;)
Thanks a lot! I would propose introducing some kind of prioritization for the requests to ES. ES requests as part of a browser/API request should have a higher priority over background ES requests (like saving results of finished jobs). The later should probably be done in an unblocking background fashion anyway, e.g. background requests shouldn't be in the way of foreground requests. Just my two cents. ;-)
@nadouani I just verified, we already had the global and per-analyzer setting like this: "auto_extract_artifacts":false
, but this did not help with all analyzers as described above.
@nadouani @To-om any update on fixing this issue? :eyes:
Hello, still looking forward a fix for this issue.
Yes, same here. @nadouani does StrangeBee provide paid support/development for issues like this? I would be interested.
Hello, Any update on the matter ?
Request Type
Bug
Work Environment
Problem Description
When ending a big quantity of artefact to Cortex to get analyze by a few analyzers, cortex becam unresponsive. Front Page is blank while answering code 200 and it is impossible to get access or communication using API. At the end of all jobs that continue running, service is again available.
Issue is report are not sent back to thehive, you need to rerun analyzer and result is given directly (cached result)
Steps to Reproduce
Possible Solutions
I did add this to the application.conf this helped in some case but not all.
Complementary information
(add anything that can help identifying the problem such as log excerpts, screenshots, configuration dumps etc.)