Cortex performance issues

joseluratm commented 5 years ago

Request Type

Bug

Work Environment

Question	Answer
OS version (server)	Ubuntu
Cortex version / git hash	3.0.0-1
Package Type	Binary
ElasticSearch version	6.7
Browser type & version	Chrome

Problem Description

After updating Cortex and TheHive to the latest versions (respectively 3.0.0-1 and 3.4.0-1) I have noticed a drop in performance in the execution of Cortex analyzers, mainly when I launch several in block from Thehive. In the Cortex log I receive the following message:

[info] o.t.c.s.ErrorHandler - GET /api/job/Xq1zYm0Bdo8Y_Io8dnA8/waitreport?atMost=1%20minute returned 500
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://application/user/audit#1151591280]] after [60000 ms]. Message of type [org.thp.cortex.services.AuditActor$Register]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
        at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:874)
        at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:113)
        at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:872)

And then the next one:

[error] o.e.d.DBConfiguration - ElasticSearch request failure: DELETE:/_search/scroll/?
StringEntity({"scroll_id":["DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAIrlFjdzaTEtXzY1UVgyZFN6SGdqTnBKVncAAAAAAACK5BY3c2kxLV82NVFYMmRTekhnak5wSlZ3AAAAAAAAiuYWN3NpMS1fNjVRWDJkU3pIZ2pOcEpWdwAAAAAAAIrnFjdzaTEtXzY1UVgyZFN6SGdqTnBKVncAAAAAAACK6BY3c2kxLV82NVFYMmRTekhnak5wSlZ3"]},Some(application/json))
 => ElasticError(404,404,None,None,None,List(),None)

I have noticed that this problem, I receive it when having activated in several analyzers the AutoExtract flag, when I deactivate it does not fail. Also (I don't know if it has to see), when this happens the website is inaccessible and you can't navigate through it.

Steps to Reproduce

enable the auto extract flag in the Shodan and VirusTotal analyzers
Run this analyzers over several ips (120 in my case)
Wait
Try to access the website

in previous versions this worked like a lightning for me.

Thank you for all your efforts!!

torsolaso commented 5 years ago

Yes, i noticed some "lags". At the time of launching an analyzer, the output delays a bit. I dont known why but is true what you said if i disable auto extract on cortex i get the output quickly.

How can i help you debugging the error?

joseluratm commented 5 years ago

Yes, i noticed some "lags". At the time of launching an analyzer, the output delays a bit. I dont known why but is true what you said if i disable auto extract on cortex i get the output quickly.

How can i help you debugging the error?

I thought that by removing the autoextract (although it lost a great functionality) I could at least work. But there is a point coming where I restart cortex every so often. The truth is that I do not know where to tackle the problem ...

IAreKyleW00t commented 5 years ago

We are also having the same issue, even in a completely clean environment.

Running a single analyzer over a number of Observables (auto-extract is off) causes TheHive and Cortex to completely lockup and become unusable. I have not found any specific errors that were not mentioned before.

Before updating, everything was just fine - minor slowdowns when running A LOT of analyzers, but still usable.

torsolaso commented 5 years ago

i think loss of performance is come in the library "elastic4play" because if you do a test for example with a large list of observales to import on a analyzer report, or same test with a large list of case observables and click on "add IoC tag" the process takes much longer than in the previous version.

Could it be possible?

joseluratm commented 5 years ago

We are also having the same issue, even in a completely clean environment.

Running a single analyzer over a number of Observables (auto-extract is off) causes TheHive and Cortex to completely lockup and become unusable. I have not found any specific errors that were not mentioned before.

Before updating, everything was just fine - minor slowdowns when running A LOT of analyzers, but still usable.

i think loss of performance is come in the library "elastic4play" because if you do a test for example with a large list of observales to import on a analyzer report, or same test with a large list of case observables and click on "add IoC tag" the process takes much longer than in the previous version.

Could it be possible?

@IAreKyleW00t I did the test in an empty environment and the same thing happens to me. And I have tested on theHive what @torsolaso says and it is true that there is also a drop in performance.

I comment on the tests I have done to see if they help:

I have changed the version of elastic4play for previous versions (until the last compatible with elastic6)
I have changed the version of elasticSearch several times (6.5, 6.6,6.7)
I've tried uploading sksamuel.elastic4s-http* packages to the latest versions.
I debugged http requests from TheHive and Cortex and found nothing unusual.

The truth is that I don't know what else to try... I think the problem lies mainly in the elastic4s library, and in the change of the elastic protocol (tcp to http). But I'm not sure.

Maybe the fault lies in some special configuration that has to be indicated in elasticsearch and that I don't know.

I will continue investigating, I hope to find something. Thanks!

torsolaso commented 5 years ago

Hi all,

i tried all i can imagine to fix this, and the current situation is very frustrating

@To-om @nadouani @saadkadhi Are you aware of this limitation? Do you have any plans to address this issue?

Maybe we can contact with Elastic community to try to locate the problem

Regards,

xme commented 4 years ago

Since I upgrade my env, I'm facing the same situation... Anybody successfully fixed this issue or found a workaround? (disabling auto-extract seems a big loss of usability for me)

KRUXLEX commented 4 years ago

We got the same issue her. The same error, the same symptoms. I know that TheHive project want to some "Grapa" (don't remember exactly name). But before this happen someone should fix this issue.

BrijJhala commented 4 years ago

Corext, We have observed that one observable runs against one analyzer took 4-9 seconds to get response back (scan,status and job detail) with 50 users simulation load using JMeter htting REST endpoint. For 100 users, it takes 18 seconds for one analyzer/one observable. I hardly see CPU spike more than 50% and heap is not being fully utilized. Tried various option using parallelism-min = 8 , parallelism-factor = 4.0 and Max number of threads available for analyze parallelism-max = 64. No luck. Appreciate some comment or recommendation on it.

If same analyzer (IBMXForce lookup), it takes roughly 2 seconds with 50 or 100 user simulations. so its quiet huge difference when we analyze through cortex.

BrijJhala commented 4 years ago

Most recent update: Even 200 users with single analyzer JMeter load testing + 3 pods of cortex: start throwing timeout issues.

confusedsecuritydudes commented 4 years ago

Fresh build this morning and experiencing the same with the whole of Cortex seemingly freezing until it has finished (with extract observable enabled). CPU and disk utilisation seems to be really low during this. Sometimes a refresh of the browser and it seems to burst into life but that may just be due to timing.

torsolaso commented 4 years ago

Today, i repeated this tests and i got same results. I can see analyzer's report on Cortex (even without auto extract) while on TheHive the "engine icon" remains looping.

I use curl and time to get report extract time. time curl 'http://<cortex>/api/job/<jobID>/report' results are:

real    1m8.839s
user    0m0.026s
sys 0m0.032s

On cortex only one error is shown:

[error] c.c.s.CortexAnalyzerSrv - Request of status of job <jobID> in cortex LOCAL CORTEX fails and the number of errors reaches the limit, aborting
java.util.concurrent.TimeoutException: Request timeout to cortex/<cortexIP:port> after 61000 ms
        at play.shaded.ahc.org.asynchttpclient.netty.timeout.TimeoutTimerTask.expire(TimeoutTimerTask.java:43)
        at play.shaded.ahc.org.asynchttpclient.netty.timeout.RequestTimeoutTimerTask.run(RequestTimeoutTimerTask.java:48)
        at play.shaded.ahc.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663)
        at play.shaded.ahc.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738)
        at play.shaded.ahc.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466)
        at java.lang.Thread.run(Thread.java:748)

On TheHive state change from "InProgress" to "Success" but in the UI i got the triangle.

[info] o.t.c.s.JobSrv - Looking for similar job in the last 10 minutes (worker=<id>, dataType=<type>, data=Left(<value>), tlp=2, parameters={})
[info] o.t.c.s.AuditActor - Register new listener for job UQiCSnQB7jdyaIZ-__hR (Actor[akka://application/temp/$jb])
[info] o.t.c.s.AuditActor - Job UQiCSnQB7jdyaIZ-__hR has be updated (JsDefined("InProgress"))
[info] o.t.c.s.ProcessJobRunnerSrv - Execute <analizer.py> in <folder>, timeout is 30 minutes
[info] o.t.c.s.ProcessJobRunnerSrv -   Job UQiCSnQB7jdyaIZ-__hR: You are using python-magic, though this module requires file-magic. Disabling magic usage due to incompatibilities.
[info] o.t.c.s.AuditActor - Unregister listener for job evvhQ3QBsmikJxQXfktI (Actor[akka://application/temp/$ib])
[info] o.t.c.s.ErrorHandler - GET /api/job/evvhQ3QBsmikJxQXfktI/waitreport?atMost=1%20minute returned 500
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://application/user/audit#92198841]] after [60000 ms]. Message of type [org.thp.cortex.services.AuditActor$Register]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
        at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:874)
        at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:113)
        at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:872)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
        at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
        at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
[info] o.t.c.s.AuditActor - Unregister listener for job UQiCSnQB7jdyaIZ-__hR (Actor[akka://application/temp/$jb])
[info] o.t.c.s.ErrorHandler - GET /api/job/UQiCSnQB7jdyaIZ-__hR/waitreport?atMost=1%20minute returned 500
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://application/user/audit#92198841]] after [60000 ms]. Message of type [org.thp.cortex.services.AuditActor$Register]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
        at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:874)
        at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:113)
        at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:872)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
        at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
        at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
[info] o.t.c.s.AuditActor - Register new listener for job UQiCSnQB7jdyaIZ-__hR (Actor[akka://application/temp/$kb])
[info] o.t.c.s.AuditActor - Register new listener for job evvhQ3QBsmikJxQXfktI (Actor[akka://application/temp/$lb])
[info] o.t.c.s.AuditActor - Register new listener for job evvhQ3QBsmikJxQXfktI (Actor[akka://application/temp/$mb])
[info] o.t.c.s.AuditActor - Job UQiCSnQB7jdyaIZ-__hR has be updated (JsDefined("Success"))
[info] o.t.c.s.JobSrv - Job UQiCSnQB7jdyaIZ-__hR has finished with status Success

BrijJhala commented 4 years ago

@nadouani , It would be great if you can add more light on this issue. So we have been running JMeter test on cortex REST end point for one analyzer /run and /reports. it creates around 24k cortex jobs. here are issues once it reaches 11k or 24k cortex jobs:

our node.js application [cortex-connector pod] hitting cortex api /run and /reports [cortex pod] using axios httpClient. our node.js application can not hit the cortex rest endpoint although we can hit cortex directly without any issue.
now once cortex is restarted, our node.js application is able to hit /run and /report using our httpClient (Axios).

so here does cortex akka threads holds the connections of client forever even though client is restarted.?

[cortex-connector pod] ----> [ cortex pod]

another issue is we keep seeing OOM of cortex after 2 hours of cortex long run.

TheHive-Project / Cortex