TheHive-Project / TheHive

TheHive: a Scalable, Open Source and Free Security Incident Response Platform
https://thehive-project.org
GNU Affero General Public License v3.0
3.43k stars 623 forks source link

[Bug] Continued performance issues after upgrade to 4.1.1 #1896

Closed jhk70 closed 3 years ago

jhk70 commented 3 years ago

Continued performance issues after upgrade to 4.1.1

Request Type

Bug

Work Environment

Question Answer
OS version (server) Ubuntu
OS version (client) 18.04
TheHive version / git hash 4.1.1 (docker image 4.1.1-2
Package Type Docker
Browser type & version Various

Problem Description

After upgrading from 4.0.5-1 to 4.1.0 and then 4.1.1:

  1. audit entries don't show in the application "live stream" view.
  2. I get the familiar "AuditSrv" error after a while
  3. the "Data Index Status" section of the "Platform Status" page does not load (i.e. user session times out before it loads). This was consistent behaviour for 4.1.0 and 4.1.1. The Audit table has 1,265,475 entries.

Steps to Reproduce

  1. Upgrade the hive as described here
  2. Configure local lucene index.
  3. Start server.
  4. Use Server

Complementary information

Other observations / debug actions:

  1. During initial indexing, there were a number of "org.janusgraph.diskstorage.TemporaryBackendException: Temporary failure in storage backend" errors. Removing MAX_HEAP_SIZE and HEAP_NEWSIZE settings on cassandra removed these.
  2. During initial periods after the upgrade, there was evidence of memory exhaustion. More RAM was added and the host and thehive was given 16g via -e JAVA_OPTS='-Xms16g -Xmx16g'
  3. Without the "Platform Status" page, I have been able to reindex with curl: curl -k "https://<host>:9000/api/v1/admin/index/Case/reindex" -H 'Authorization: Bearer *authwibble*' I have re-run these for each Index and the logs show that these complete successfully.
  4. Snippets from the Audit reindex logs:

    Mar 25 21:39:52 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 1265475 record(s) indexed
    Mar 25 21:39:53 hivehost01 docker[26287]: [info] o.j.g.d.m.ManagementSystem [|] Index update job successful for [AuditRequestidMainaction]
    Mar 25 21:39:53 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is finished
    Mar 25 21:47:59 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 0 record(s) indexed
    Mar 25 21:48:00 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 0 record(s) indexed
    Mar 25 21:48:01 hivehost01 docker[26287]: [info] o.j.g.o.j.IndexRepairJob [|] Found index Audit
    Mar 25 21:48:01 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 0 record(s) indexed
    Mar 25 21:48:02 hivehost01 docker[26287]: [info] o.j.g.d.m.ManagementSystem [|] Index update job successful for [Audit]
    Mar 25 21:48:02 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is finished
  5. Our implementation had been "misusing" tags (per the 4.1.0 release blog) and had some long tags containing links to raw alerts etc. This was evidenced with a 6sec load time on /api/v1/query?name=list-tags. I have deleted these tags from the "Custom Tags" view. Is it possible something in the Audit content could be causing this? Is it possible to truncate / compact the Audit table?
  6. Probably unrelated but I see this on start of the server: Mar 25 21:27:40 hivehost01 docker[26287]: [warn] c.d.d.c.RequestHandler [|] Query '[4 bound values] SELECT column1,value,writetime(value) AS writetime,ttl(value) AS ttl F ROM thehive.graphindex WHERE key=:key AND column1>=:sliceStart AND column1<:sliceEnd LIMIT :maxRows;' generated server side warning(s): Read 947 live rows and 5788 tombstone cells for query SELECT * FROM thehive.graphindex WHERE key = 022689a05461e7 AND column1 >= 00 AND column1 < ff LIMIT 5000; token -8419547459570797906 (see tombstone_warn_threshold)
  7. I have multiple times deleted & reconfigured the index. After restart (and before index), the "platform status" page loads (all indexes = "ERROR"). After I click "Reindex" on Audit, the indexing completes and the same performance issue is present. I can then no longer refresh / view the Index Status section of the Platform Status page.
nadouani commented 3 years ago

@To-om I assigned this issue to 4.1.2 but it needs investigation. Feel free to move it out of this milestone if it requires more investigation

jhk70 commented 3 years ago

The problem is that this issue prevents a production upgrade to 4.1.1 (I didn't mention that this is a UAT instance) and leaves us stranded on 3.x. The UI is just too slow and if I have 2 or 3 analysts logged in, the CPU on the host becomes saturated.