TheHive-Project / TheHive

TheHive: a Scalable, Open Source and Free Security Incident Response Platform
https://thehive-project.org
GNU Affero General Public License v3.0
3.28k stars 609 forks source link

[Bug] - high CPU consumption #2315

Open andreyglauzer opened 2 years ago

andreyglauzer commented 2 years ago

Request Type

Bug

Work Environment

Question Answer
OS version (server) Oracle Linux
OS version (client) Windows 10
Virtualized Env. False
Dedicated RAM 64 GB
vCPU 20
TheHive version / git hash 4.1.16
Package Type RPM
Database BerlkelyDB
Index type Lucene
Attachments storage Local
Browser type & version Chrome

Problem Description

Our team has a high volume of alerts, which are opened via API in TheHive, we also create several automations, to merge alerts into cases, so API searches are also constant.

We have a total of 15 analysts accessing the platform simultaneously, and at times TheHive consumes all the server's CPUs, and the platform is inaccessible, until I terminate the TheHive process via kill and start the service again.

Steps to Reproduce

I noticed that when merging, from alerts to Cases, it tends to consume a lot of server CPU, and this is something that analysts use constantly.

But I have no proof that this is really the real problem.

edwardrixon commented 2 years ago

I am having the same issue with my instance of the hive

MDB4241 commented 2 years ago

My org also experiences this issue. Similarly spec'd system. (14 vCPU, 64 GB)

backloop-biz commented 2 years ago

Also here when access case or close it. Usually we got 60-70 observables per case with a total of 11k case (~100 open). I'm thinking about the features that check the related case by observable, can be an option?

andreyglauzer commented 2 years ago

I did a job of migrating the database from BerlkelyDB to Cassandra, I'm having great results.

I'll leave but a few weeks of testing, and I'll return

MDB4241 commented 2 years ago

Also here when access case or close it. Usually we got 60-70 observables per case with a total of 11k case (~100 open). I'm thinking about the features that check the related case by observable, can be an option?

You should see a performance increase if you use the 'ignoreSimilarity' option on non-critical case artifacts. This was a useful edit for my organization and it reduces the impact of rendering a case.

I did a job of migrating the database from BerlkelyDB to Cassandra, I'm having great results.

I'll leave but a few weeks of testing, and I'll return

Glad you've found a potential resolution! Unfortunately, we are already on Cassandra :(

We've even increased resources to 40 vCPU to troubleshoot and the problem persists. We have a single host with TheHive, Cortex, Cassandra & ElasticSearch. Perhaps separating these services into dedicated hosts will yield better performance.

andreyglauzer commented 2 years ago

Also here when access case or close it. Usually we got 60-70 observables per case with a total of 11k case (~100 open). I'm thinking about the features that check the related case by observable, can be an option?

You should see a performance increase if you use the 'ignoreSimilarity' option on non-critical case artifacts. This was a useful edit for my organization and it reduces the impact of rendering a case.

I did a job of migrating the database from BerlkelyDB to Cassandra, I'm having great results. I'll leave but a few weeks of testing, and I'll return

Glad you've found a potential resolution! Unfortunately, we are already on Cassandra :(

We've even increased resources to 40 vCPU to troubleshoot and the problem persists. We have a single host with TheHive, Cortex, Cassandra & ElasticSearch. Perhaps separating these services into dedicated hosts will yield better performance.

I've noticed some analysts using "stats" which generates a time-consuming and frequently used search.

I removed this option on the frontend.

One thing I've also noticed, is large descriptions, this requires a lot from the server, I believe in the conversion. We are avoiding very long descriptions

backloop-biz commented 2 years ago

I did a job of migrating the database from BerlkelyDB to Cassandra, I'm having great results.

I'll leave but a few weeks of testing, and I'll return

Is there any docs on how perform this migration?

andreyglauzer commented 2 years ago

Is there any docs on how perform this migration?

This migration is not possible in a massive way, I had to create a new instance and open all cases and alerts via API to the cassandra database

Cyp-her commented 1 year ago

We are having the same issue as described above. Giving more and more resources to TheHive wouldn't resolve the issue. Disabling statistics solved our issue, mainly but it is really strange.

romarito90 commented 1 year ago

@andreyglauzer Hello what happen with your changes that you did to your system about thehive and cortex, what do you recommend to do?

baonq-me commented 1 year ago

I have the same issue with a VM having 16vCPU and 48GB RAM.

Taragos commented 1 year ago

We are having the same issue as described above. Giving more and more resources to TheHive wouldn't resolve the issue. Disabling statistics solved our issue, mainly but it is really strange.

Where can you disable the statistics for the frontend?

baonq-me commented 1 year ago

Me having the same issue on a physical with 2x Xeon 4210. When I hit the button "Stats", CPU comsumtion go straight to 100% and memory consumption of Thehive reported by systemd is about ~30GB.

bhjella-awake commented 5 months ago

I am having the same issue.... it will eventually use all cpus at 100% and then just stop responding. i have to kill the process to get it to work again.

hive - 8cpu 32gb ram 3x elastic - 8 cpu 32gb ram each 3x cassandra - 8 cpu 32gb ram each

bhjella-awake commented 5 months ago

from observations.... Elastic gets to like 600% cpu utilization then drops then thehive gets to 300% and stays there. this happens again, elastic 600% then the hive jumps to 600% like 20 min later. then elastic again 600% and hive now maxed at 800% and everything freezes.

feels like there is some thread that doesn't timeout and just spins forever

baonq-me commented 5 months ago

from observations.... Elastic gets to like 600% cpu utilization then drops then thehive gets to 300% and stays there. this happens again, elastic 600% then the hive jumps to 600% like 20 min later. then elastic again 600% and hive now maxed at 800% and everything freezes.

feels like there is some thread that doesn't timeout and just spins forever

There is workaround by limiting number of CPU core consumed by ElasticSearch. By default, ElasticSearch use all CPUs available.

To limit CPU cores used, add this line to elasticsearch.yaml

node.processors: 4 # allow 4 CPUs to be used.
bhjella-awake commented 5 months ago

thanks, but the issue isn't that is using cpu, its there is some thread that just never ends the constantly consumes the process. I have narrowed it down to when a specific user is using it. Going to try and determine what he is doing that is causing the hive to just churn CPU.. it normally sites arround 100-200 CPU when utilized durring work day. Except for this one user.

bhjella-awake commented 5 months ago

Found out the problem on my side. It was a user using the stats button on the cases page: image

I have implemented rules on my apache's reverse proxy to 401 the API requests for those:

        RewriteEngine On

        # Block query string name=case-by-tags-stats
        RewriteCond %{QUERY_STRING} name=case-by-tags-stats [NC]
        RewriteRule ^ - [R=401,L]

        # Block query string name=case-by-status-stats
        RewriteCond %{QUERY_STRING} name=case-by-status-stats [NC]
        RewriteRule ^ - [R=401,L]

        # Block query string name=case-by-resolution-status-stats
        RewriteCond %{QUERY_STRING} name=case-by-resolution-status-stats [NC]
        RewriteRule ^ - [R=401,L]