OpenCTI-Platform / opencti

Open Cyber Threat Intelligence Platform
https://opencti.io
Other
6.4k stars 946 forks source link

Very low consumer ACK rate #4936

Closed MaxwellDPS closed 4 months ago

MaxwellDPS commented 11 months ago

Description

Having a very hard time loading data to the platform, uploading any data takes a long time to process. 50K items in the queue can take > 3 hours, this leads to seeing an ack rate sustained of ~.7/s. There is not a manageable way to import this data before it begins to backlog

Environment

  1. OS (where OpenCTI server runs): CentOS stream 9
  2. OpenCTI version: 5.11.14
  3. OpenCTI client: Frontend
  4. Other environment details:
       15x Platform (No Web) @ 1c 6Gb (Reserved)
       25x Worker            @ 1c 4Gb (Reserved)
       k8s Clustered deployment
    
       Elastic Cluster: 3x Data Node @ 16Gb 4c per 
       RabbitMQ Cluster: 3x Node @ 4Gb 1c per 
       Redis: 1x 16Gb 3
    
       Host Networking: 25Gbps x 3 Hosts

None of these seem resource saturated, I am as scaled as I can and the platform is nearly not able to import data

Reproducible Steps

Steps to create the smallest reproducible scenario: N/A

Expected Output

ACK rate that imports data fast enough to not buildup a backlog

Actual Output

Additional information

Resource usage

NAME                                                    CPU(cores)   MEMORY(bytes)   
elastic-es-data-0                                       1923m        14645Mi         
elastic-es-data-1                                       884m         11464Mi         
elastic-es-data-2                                       1443m        14217Mi         
elastic-es-leaders-0                                    131m         1931Mi          
elastic-es-leaders-1                                    74m          1833Mi          
elastic-es-leaders-2                                    46m          1927Mi                
minio-66787954dd-9g7sn                                  1m           180Mi           
opencti-api-59cc685c95-84qnp                            1398m        1401Mi          
opencti-api-59cc685c95-9j4vr                            502m         550Mi           
opencti-api-59cc685c95-bs6sk                            2131m        5267Mi          
opencti-api-59cc685c95-dhczm                            2272m        3755Mi          
opencti-api-59cc685c95-g6hnt                            953m         3376Mi          
opencti-api-59cc685c95-qzwjq                            389m         535Mi           
opencti-api-59cc685c95-z9t5g                            398m         543Mi           
opencti-api-59cc685c95-zkvfs                            1829m        1929Mi                   
opencti-web-7c45ccb866-66wnj                            810m         760Mi           
opencti-web-7c45ccb866-c65fk                            632m         677Mi           
opencti-web-7c45ccb866-ldcg7                            1147m        814Mi           
opencti-web-7c45ccb866-rhp5v                            698m         675Mi           
opencti-web-7c45ccb866-xmzvb                            449m         749Mi           
opencti-worker-78d7d54ff7-4j6nb                         5m           42Mi            
opencti-worker-78d7d54ff7-58wjh                         4m           42Mi            
opencti-worker-78d7d54ff7-6d6q9                         4m           42Mi            
opencti-worker-78d7d54ff7-84795                         3m           42Mi            
opencti-worker-78d7d54ff7-8txgh                         4m           42Mi            
opencti-worker-78d7d54ff7-9nchs                         5m           42Mi            
opencti-worker-78d7d54ff7-dhv56                         4m           42Mi            
opencti-worker-78d7d54ff7-ffh2b                         4m           42Mi            
opencti-worker-78d7d54ff7-m5pkk                         1m           42Mi            
opencti-worker-78d7d54ff7-nxpwf                         4m           42Mi            
opencti-worker-78d7d54ff7-p7v7j                         3m           42Mi            
opencti-worker-78d7d54ff7-qq5cl                         4m           42Mi            
opencti-worker-78d7d54ff7-tj5vx                         3m           42Mi            
opencti-worker-78d7d54ff7-w9z66                         3m           42Mi            
opencti-worker-78d7d54ff7-wclsm                         3m           42Mi            
opencti-worker-78d7d54ff7-wd7rl                         4m           42Mi            
opencti-worker-78d7d54ff7-wphtq                         4m           42Mi            
opencti-worker-78d7d54ff7-wsl49                         1m           42Mi            
opencti-worker-78d7d54ff7-wv8l5                         3m           42Mi            
opencti-worker-78d7d54ff7-xbrh4                         5m           42Mi            
rabbitmq-server-0                                       17m          767Mi           
rabbitmq-server-1                                       16m          822Mi           
rabbitmq-server-2                                       26m          683Mi           
redis-0                                                 503m         14888Mi   

Screenshots (optional)

image

nino-filigran commented 11 months ago

@MaxwellDPS I'm not sure exactly what are you looking for. What is the bug exactly? Is this a feature request to improve our ingestion perfromance? In the meantime, I encourage you to ask your question on our Slack, using the channel OpenCTI, so that our community might be able to help you out.

MaxwellDPS commented 11 months ago

Hey @nino-filigran!

First thing I did was ask slack πŸ¦— πŸ¦— πŸ¦— https://filigran-community.slack.com/archives/CHZC2D38C/p1700517109042759

What is the bug exactly?

^^ Uploading any data takes a long time to process. 50K items in the queue can take > 3 hours, this leads to seeing an ack rate sustained of ~.7/s. There is not a manageable way to import this data before it begins to backlog... Not sure how to be more clear here :/

The platform is flying like a Cessna in a F22 airframe, something is wrong. It has power, is configured per the docs for a clustered deployment. I dont have a deep enough understanding of the platforms DB, Redis arch to keep looking for bottle necks aside the obvious and I see no resource contention

I will take anything here: Docs (not in the troubleshooting guide), guidance as to what to try to modify to debug, etc. But not sure what else yall need to help.

MaxwellDPS commented 11 months ago

@nino-filigran @SamuelHassine @richard-julien same as all these, its been a running issue. The issue appeared in 5.7.6 for me, and has stayed in 5.8.X, 5.10.X, 5.11.X.

Please can we get some further investigation here, more than willing to help debug on my end

https://github.com/OpenCTI-Platform/opencti/issues/3513 https://github.com/OpenCTI-Platform/opencti/issues/4722 https://github.com/OpenCTI-Platform/opencti/issues/4748

MaxwellDPS commented 11 months ago

It will take 24 days and 2 hours to process 1.05M bundles

For the sake of testing I just doubled CPU and memory on everything (Including ES, RMQ, Redis, and all non-connector OCTI pods) This caused no change in the ACK rate.

The only thing this will marginally improve ack rate is restarting the platform and even that is short lived.

I have also tried adjusting Redis trimming, and lock timing and retry jitter, not seeing any tangible change in performance from that either.

Here is what this looks like in RMQ over time (spikes are restarts of the OCTI cluster)... This will take 585 hours (or 24 days and 2 hours) to process at an average rate of 0.5/s, this is non-useable image image


5.7.6 Upgrade

Y'all, please take a serious look at this, it has been an issue since 5.7.2 I've been reporting since summer

This is when 5.7.6 rolled out, prompting the creation of https://github.com/OpenCTI-Platform/opencti/issues/3513. There is a very clear correlation, to that and the issues. image

I'm almost positive it has todo w/ elastic, ES Rally shows all good, so seems like data structure. We are running well in the bounds of the recommended specs, we are deployed as a cluster per recommendations and docs.

After all the work we did with y'all to work on the elastic stack and provide docs guidance @richard-julien, please help πŸ™

MaxwellDPS commented 10 months ago

There is an issue with upgrade path

Will be updating with some new metrics next week following a complete re-sync to a new platform.

The old instance (160M Docs) takes on avg 30-45 seconds to load the dashboard, and 60+s to load a large report The new instance w/ half the data (80M Docs so far) loads the dashboard on avg ~5-15s , and ~10s to load the same report

Also want to note, the new instance is seeing an ack rate of ~15/s over 7 days vs the old ones <1/s (Tracked via Prom)

Both platforms are deployed to the same cluster in the exact same config

MaxwellDPS commented 10 months ago

https://github.com/OpenCTI-Platform/opencti/issues/5500

This seems likely related, @SamuelHassine @richard-julien this old database is gonna be nuked in <3 weeks. If y'all want anything to further debug here I will need to know soon

richard-julien commented 10 months ago

Hi @MaxwellDPS, still sorry to not have kind of magic answer for this performance problem. The performance you have its not what we observed in our hosted instances or other users that have more than 1 billion of docs. Don't get me wrong, 60+s to load a large report its not a normal situation, but the performance are not linear and depends of the type of data, directly related to custom connectors you could have developed and/or deployed. Elasticsearch indices design and rollover strategy can also play a large role.

This is unfortunate but in order to remediate this situation we need to have access to the data to investigate the data shape and replicate this behavior internally to imagine some possible remediation (connectors code improvements, cleanup script, ...). Of course we still are very open to sign an NDA / do anything required to unlock this situation, have access to your data and solve your problem.

MaxwellDPS commented 10 months ago

Hey @richard-julien, thats great to hear, I can offer access to the data. But I do need a way we can share that in a meaningful way.

Given the scope here, I really dont know where to start and am deferring to y'all.

richard-julien commented 10 months ago

We can open a sharing directory in the Filigran onedrive. This way you should be able to share a complete snapshot of your elasticsearch indices. Please contact me with a private slack message to setup the transfer. Thanks

MaxwellDPS commented 9 months ago

@richard-julien @SamuelHassine I think I may finally have a root cause, not 100% sure but it makes sense through and through aswell as why y'all cant recreate it Would all of these symptoms make sense if there were hiccups like TCP RSTs in the connection to the redis server?

I am seeing a TON of these Extending resources for long processing task errors for redis in the platform logs. I know we have an elevated rate of RSTs that seem to be from istio, if the connection to redis was lost is it possible OCTI is waiting for locks to expire and that could be artificially throttling the platfom?!

Previously the low ack rate assumption was Elastic, and a bad database. I have resolved the Bad DB w/ a sync to a new platforom, but the ack rate is still low, but bursts which would be explainable if a lock expires.

Is there anything we can do or can y'all check the Redis implmentation for anything that may cause something like this

MaxwellDPS commented 9 months ago

Seeing these redis errors, seems like PubSub is being Killed @richard-julien any recommendations on this or setting client-output-buffer-limit

Client id=3675 addr=192.168.1.185:55178 laddr=192.168.7.131:6379 fd=13 name= age=2391 idle=2 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=8 omem=37617856 tot-mem=37655672 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
1:M 13 Feb 2024 23:09:53.741 # Client id=4818 addr=192.168.1.185:59572 laddr=192.168.7.131:6379 fd=13 name= age=18 idle=3 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=15 omem=39031144 tot-mem=39033128 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
1:M 13 Feb 2024 23:10:38.476 # Client id=4838 addr=192.168.1.185:38746 laddr=192.168.7.131:6379 fd=13 name= age=22 idle=5 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=16384 rbp=0 obl=0 oll=16 omem=40456576 tot-mem=40473920 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
MaxwellDPS commented 9 months ago

Setting redis-cli config set client-output-buffer-limit "pubsub 512m 128m 0" cleared up these logs, and ack rate while lower than expected has been more stable

MaxwellDPS commented 8 months ago

Julien, Samuel, 99% sure this is due to network interupts leading to a disconnect of the pubsub that is not reconnecting properly leading to low ack rates and odd data structure.

How can we proceed to fix this, still seeing it in 5.12.33, happy to open an issue for that?

MaxwellDPS commented 8 months ago

Julien there seems to be an issue that causes the ack rtate to drop off, was seeing 12+ hours of >15/s then it just died, dropped to <0.5/s and didnt recover untill I flushed redis

this error started right as the ack rate dropped and directly corresponds to an ingest rate drop in elastic.

This error seems directly related to the ACK rate drop off, it seems like locking or pubsub is also playing a factor

Julien -- Please advise πŸ™

errors{}.message = Error in store update event 
errors{}.message = Update event must contains a valid previous patch
{
    "category": "APP",
    "errors": [
        {
            "attributes": {
                "genre": "TECHNICAL",
                "http_status": 500
            },
            "message": "Error in store update event",
            "name": "DATABASE_ERROR",
            "stack": "DATABASE_ERROR: Error in store update event\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at DatabaseError (/opt/opencti/build/src/config/errors.js:58:48)\n    at storeUpdateEvent (/opt/opencti/build/src/database/redis.ts:504:11)\n    at updateAttributeMetaResolved (/opt/opencti/build/src/database/middleware.js:1966:27)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at addStixRefRelationship (/opt/opencti/build/src/domain/stixRefRelationship.js:70:3)"
        },
        {
            "attributes": {
                "genre": "BUSINESS",
                "http_status": 500
            },
            "message": "Update event must contains a valid previous patch",
            "name": "UNSUPPORTED_ERROR",
            "stack": "UNSUPPORTED_ERROR: Update event must contains a valid previous patch\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at UnsupportedError (/opt/opencti/build/src/config/errors.js:83:51)\n    at buildStixUpdateEvent (/opt/opencti/build/src/database/redis.ts:467:11)\n    at buildUpdateEvent (/opt/opencti/build/src/database/redis.ts:493:10)\n    at storeUpdateEvent (/opt/opencti/build/src/database/redis.ts:498:21)\n    at updateAttributeMetaResolved (/opt/opencti/build/src/database/middleware.js:1966:27)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at addStixRefRelationship (/opt/opencti/build/src/domain/stixRefRelationship.js:70:3)"
        }
    ],
    "inner_relation_creation": 0,
    "level": "warn",
    "message": "Error in store update event",
    "operation": "StixRefRelationshipAdd",
    "operation_query": "mutation StixRefRelationshipAdd($input:StixRefRelationshipAddInput!){stixRefRelationshipAdd(input:$input){id standard_id entity_type parent_types}}",
    "size": 302,
    "time": 224,
    "timestamp": "2024-02-28T19:18:20.055Z",
    "type": "WRITE_ERROR",
    "user": {
        "applicant_id": "6a4b11e1-90ca-4e42-ba42-db7bc7f7d505",
        "call_retry_number": "1",
        "group_ids": [],
        "ip": "192.168.6.192",
        "organization_ids": [],
        "socket": "query",
        "user_id": "6a4b11e1-90ca-4e42-ba42-db7bc7f7d505",
        "user_metadata": {}
    },
    "version": "5.12.33"
}
MaxwellDPS commented 8 months ago

Samuel//Julien - To give a solid example, after running fine since ~12 PM PT 2/27. At 6:32AM PT 2/28 there was a drop in ack rate to almost 0.

When this happened there are the following symptoms:

All of this is inline with https://github.com/OpenCTI-Platform/opencti/issues/3513 - 5.8; 5.9, 5.10, 5.11, 5.12 - Have not fixed this

Platform metrics showing a spike in bundle lock errors and a drop in bundle counts

image image

Seeing a spike in errors logged by the platform (notably a lot of lock and update errors)

image image image image

RabbitMQ showing ACK rate drops

image image

image

Redis keys dropping

image

Elastic showing increased latency

image image image

Where can I continue to look, how can I get the ack rate back up? @richard-julien (sorry for the @ but its been a month sir)

richard-julien commented 8 months ago

I @MaxwellDPS, i still thinking that this kind of behavior (too many lock, performance drop, Redis disconnect due to large event) can be due to some of your data. We know that we are quite slow to ingest some cases where a mono source of messages try to create a large number of connected elements.

I personally work to find a solution to improve that case but for now the best solution to mitigate/workaround that behavior is to work on data sharding by changing the approach of the custom connectors you may have developed or deploy the same connector on different time range. I know this situation is frustrating but solving this will really require to dig in into your data and your connectors.

Some errors you have looks unrelated to this lock/perf problem (like Update event must contains a valid previous patch). Are you able to share your error logs?

MaxwellDPS commented 5 months ago

https://github.com/OpenCTI-Platform/opencti/issues/6285 Possibly related

MaxwellDPS commented 4 months ago

Wanted to update and close this out, these ongoing rate issues were part of #7436.

@richard-julien I really appreciate the help with this one! πŸ₯‡