Bug: Events taking much longer to load after v2.4.186 update

mdavis332 commented 3 months ago

Actual behavior

Since updating to MISP v2.4.186 I've noted far more frequent 504 Gateway Timeouts when loading Events at /events/view/1234. Upon investigation, this occurred only on my instances deployed in AWS behind ALBs which have a default idle timeout of 60 seconds.

I went back in my ALB logs and noted relatively few ELB 5xx responses prior to the 2.4.186 update. After this update in both dev and prod environments, I see them frequently. It happens on those events with much higher numbers of attributes (>50-60k attrs like those events which are populated from a single OSINT feed). I see the same lengthy load time in a local docker container as well, but it happily works until it loads due to no ALB idle timeout.

I can, of course, adjust my ALB idle timeout upward. However, this was never an issue prior to v2.4.186 so I'm concerned something may have changed about the db or its access that is creating these much longer page load times (decreased performance).

Expected behavior

Viewing the same event should take relatively the same amount of time between versions (within reason).

Steps to reproduce

Populate an event on MISP older than 2.4.186
Load it multiple times (https://misp.domain.tld/events/view/1234) and calculate avg response time
Update MISP to v2.4.186
Repeat Step 2 and observe whether it takes significantly longer

Version

2.4.186

Operating System

CentOS

Operating System version

Stream8

PHP version

7.4

Browser

Chromium, Firefox

Browser version

No response

Relevant log output

No response

Extra attachments

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

mdavis332 commented 3 months ago

As a bit more evidence of hitting the database with more requests, here is my AWS RDS cpu credit balance over the last 4 weeks. I annotated the date when we upgraded from v2.4.184 to v2.4.186 (2024-03-05 at approx 20:37 UTC).

Avg Network Receive throughput to the DB and its CPU Utilization has tripled since that point. Avg Network Transmit throughput from the DB back to clients has roughly doubled.

Read and Write latency look mostly the same.

github-germ commented 3 months ago

I've also reported a performance hit with >= 2.4.186. My very loose hypothesis is related to db queries related to the new Analyst Notes feature.

Hestat commented 3 months ago

Would like to report seeing similar issues on a MISP instance I work on. We've seen spikes in both CPU and RAM utilization. At times it has reached levels resulting in memory exhaustion and severely degraded application performance.

github-germ commented 3 months ago

Some good news... I was talking with @iglocska last week. He is aware, has a prelim fix, and will dig a bit more soon.

adulau commented 3 months ago

Maybe you want to try this https://github.com/MISP/MISP/commit/0fb58cff4466f2a685063932e99e0ea7bea28280 - it's already in 2.4 branch

mdavis332 commented 3 months ago

Maybe you want to try this 0fb58cf - it's already in 2.4 branch

I tested that commit's changes in a local instance using an Event with about 100k attributes: before changes -- timed out after trying to load the event for 5 minutes after changes -- took approx 12 seconds to render the same event

I'm looking forward to the commit getting a bit more testing and making it into 2.4.188 :)

mdavis332 commented 3 months ago

Got v2.4.188 in production and event load times and cpu utilization are much better, but swap usage and freeable memory continue to be worse than prior to v2.4.186. Not sure what else was changed in that version, but general MISP usability (logging in, loading Events page, query restSearch api, etc.) still continues to be much less performant than prior to 2.4.186.

github-germ commented 3 months ago

The issue is related to the new Analyst Notes feature. @iglocska and I have been sharing symptoms and he has a fix in mind.

mdavis332 commented 3 months ago

Thanks for the quick reply @github-germ. It appears there may have been another change in v2.4.188 that introduced new issues I'm seeing. I say that because I believe Analyst Notes was added in v2.4.186 and I'm observing new performance issues in v2.4.188.

From Mar-27 to Apr-01 I was running v2.4.186 (with decreased performance from increased CPU utilization on DB server). However, on Apr-01 I upgraded to v2.4.188.

While db server CPU utilization went back down, you can see that total db IOPS spiked and settled much higher than prior versions (including v2.4.186).

Still possibly Analyst Notes related, but leads me to believe the IOPS hit was introduced newly in 2.4.188.

That leads me to believe v2.4.188 introduced another performance issue that increased total IOPS.

iglocska commented 3 months ago

Interesting! I am not sure what new bottleneck could have been introduced by 2.4.188, besides resolving some other performance issues leading to a successful fetching of data that couldn't be fetched before.

I'd be curious if the high IOPS persist, will dig around to see if we can see the same on our side.

iglocska commented 3 months ago

Could you check if you can identify a process causing that? We had something funky happen with a zombie background worker.

mdavis332 commented 3 months ago

Hi @iglocska,

Could you check if you can identify a process causing that? We had something funky happen with a zombie background worker.

This performance pattern on the db server (mariadb RDS in AWS) persists even beyond container restarts, and seems to be present across both my dev and prod envs (which are almost identical). In addition to the increase in both read and write IOPS, I noted a coinciding decrease in Free Mem and increase in Swap Usage (red circles indicate when upgrade from 2.4.186 to 2.4.188 took place):

I'm no DBA, but my best guess is there's a significant enough increase to mem usage by some query (queries?) that RAM is getting overwhelmed, paging to swap, and eventually resulting in more disk access that increases IOPS. Perhaps it's multiple queries since both read AND write IOPS have increased.

IOPS spikes appear to be higher and more frequent after my upgrade from v2.4.186 to v2.4.188.

I'd be curious if the high IOPS persist, will dig around to see if we can see the same on our side.

I just captured the above screenshots after multiple container restarts over the past day or two. Indeed, the swap usage on the db server continues to be a concern for me. The peaks coincide with hourly runs we do (and have done for months) with PyMISP using the following func calls:

get_event()
add_event()
update_event()

I'll continue to keep an eye on my environments but I'd be interested if you're able to turn anything up.

iglocska commented 3 months ago

Very interesting. We don't seem to see the same pattern on our instances, but will monitor it closer. I am currently working on some changes that should lower the db access burden for certain actions, but the big bottlenecks I'm seeing currently are in the UI usage, less so the API. That increase in IOPS you are seeing there is very alarming indeed, I wonder where it comes from.

mdavis332 commented 3 months ago

Thinking about non-standard configs I'm running that may contribute to a difference in what we're seeing, the only thing I can think of is that my instances are running the NoACL Correlation Engine (enabled via /correlations/switchEngine/NoAcl since we're not worried about multi-org/community sharing/permissions).

Not sure it's relevant since I have no data suggesting it's a culprit, but thought I'd mention it just in case.

mdavis332 commented 3 months ago

Just opened #9663 on errors our instances are getting that seem to be db-related. Possibly related?

iglocska commented 1 month ago

@mdavis332 - how is performance since the latest releases for you? Still seeing the increases?

mdavis332 commented 1 month ago

@mdavis332 - how is performance since the latest releases for you? Still seeing the increases?

Thank you for checking in @iglocska. CPU utilization got better in v2.4.188 but swap and memory usage continued to be an issue, even after upgrading to v2.4.189. I relented about 3 weeks ago and just doubled the memory on the db servers since we had prod instances continuing to see slowness.

The doubling of memory helped address our immediate issues, but if I could go back in time I would just stay at v2.4.184 =/ To be fair, I haven't updated to the latest (v2.4.192) yet for a few reasons. One of which because we ran into another new issue starting in v2.4.189 with seeing errors for reading/writing against redis.

I had opened an issue elsewhere for that since not everyone here may use redis: https://github.com/NUKIB/misp/issues/157. Anyhow, deviating from this Issue's scope, so I'll stop there :)

As for the performance Issue originally noted here, it's hard for me to do apples-to-apples comparison now since I've changed the variable of the RAM. This can be closed at this point if you think there's nothing more to be done on the MISP dev side.

MISP / MISP