Closed MaxwellDPS closed 4 months ago
@MaxwellDPS I'm not sure exactly what are you looking for. What is the bug exactly? Is this a feature request to improve our ingestion perfromance? In the meantime, I encourage you to ask your question on our Slack, using the channel OpenCTI, so that our community might be able to help you out.
Hey @nino-filigran!
First thing I did was ask slack π¦ π¦ π¦ https://filigran-community.slack.com/archives/CHZC2D38C/p1700517109042759
What is the bug exactly?
^^ Uploading any data takes a long time to process. 50K items in the queue can take > 3 hours, this leads to seeing an ack rate sustained of ~.7/s. There is not a manageable way to import this data before it begins to backlog... Not sure how to be more clear here :/
The platform is flying like a Cessna in a F22 airframe, something is wrong. It has power, is configured per the docs for a clustered deployment. I dont have a deep enough understanding of the platforms DB, Redis arch to keep looking for bottle necks aside the obvious and I see no resource contention
I will take anything here: Docs (not in the troubleshooting guide), guidance as to what to try to modify to debug, etc. But not sure what else yall need to help.
@nino-filigran @SamuelHassine @richard-julien same as all these, its been a running issue. The issue appeared in 5.7.6 for me, and has stayed in 5.8.X, 5.10.X, 5.11.X.
Please can we get some further investigation here, more than willing to help debug on my end
https://github.com/OpenCTI-Platform/opencti/issues/3513 https://github.com/OpenCTI-Platform/opencti/issues/4722 https://github.com/OpenCTI-Platform/opencti/issues/4748
For the sake of testing I just doubled CPU and memory on everything (Including ES, RMQ, Redis, and all non-connector OCTI pods) This caused no change in the ACK rate.
The only thing this will marginally improve ack rate is restarting the platform and even that is short lived.
I have also tried adjusting Redis trimming, and lock timing and retry jitter, not seeing any tangible change in performance from that either.
Here is what this looks like in RMQ over time (spikes are restarts of the OCTI cluster)... This will take 585 hours (or 24 days and 2 hours) to process at an average rate of 0.5/s, this is non-useable
Y'all, please take a serious look at this, it has been an issue since 5.7.2 I've been reporting since summer
This is when 5.7.6 rolled out, prompting the creation of https://github.com/OpenCTI-Platform/opencti/issues/3513. There is a very clear correlation, to that and the issues.
I'm almost positive it has todo w/ elastic, ES Rally shows all good, so seems like data structure. We are running well in the bounds of the recommended specs, we are deployed as a cluster per recommendations and docs.
After all the work we did with y'all to work on the elastic stack and provide docs guidance @richard-julien, please help π
Will be updating with some new metrics next week following a complete re-sync to a new platform.
The old instance (160M Docs) takes on avg 30-45 seconds to load the dashboard, and 60+s to load a large report The new instance w/ half the data (80M Docs so far) loads the dashboard on avg ~5-15s , and ~10s to load the same report
Also want to note, the new instance is seeing an ack rate of ~15/s over 7 days vs the old ones <1/s (Tracked via Prom)
Both platforms are deployed to the same cluster in the exact same config
https://github.com/OpenCTI-Platform/opencti/issues/5500
This seems likely related, @SamuelHassine @richard-julien this old database is gonna be nuked in <3 weeks. If y'all want anything to further debug here I will need to know soon
Hi @MaxwellDPS, still sorry to not have kind of magic answer for this performance problem. The performance you have its not what we observed in our hosted instances or other users that have more than 1 billion of docs. Don't get me wrong, 60+s to load a large report its not a normal situation, but the performance are not linear and depends of the type of data, directly related to custom connectors you could have developed and/or deployed. Elasticsearch indices design and rollover strategy can also play a large role.
This is unfortunate but in order to remediate this situation we need to have access to the data to investigate the data shape and replicate this behavior internally to imagine some possible remediation (connectors code improvements, cleanup script, ...). Of course we still are very open to sign an NDA / do anything required to unlock this situation, have access to your data and solve your problem.
Hey @richard-julien, thats great to hear, I can offer access to the data. But I do need a way we can share that in a meaningful way.
Given the scope here, I really dont know where to start and am deferring to y'all.
We can open a sharing directory in the Filigran onedrive. This way you should be able to share a complete snapshot of your elasticsearch indices. Please contact me with a private slack message to setup the transfer. Thanks
@richard-julien @SamuelHassine I think I may finally have a root cause, not 100% sure but it makes sense through and through aswell as why y'all cant recreate it Would all of these symptoms make sense if there were hiccups like TCP RSTs in the connection to the redis server?
I am seeing a TON of these Extending resources for long processing task
errors for redis in the platform logs. I know we have an elevated rate of RSTs that seem to be from istio, if the connection to redis was lost is it possible OCTI is waiting for locks to expire and that could be artificially throttling the platfom?!
Previously the low ack rate assumption was Elastic, and a bad database. I have resolved the Bad DB w/ a sync to a new platforom, but the ack rate is still low, but bursts which would be explainable if a lock expires.
Is there anything we can do or can y'all check the Redis implmentation for anything that may cause something like this
Seeing these redis errors, seems like PubSub is being Killed @richard-julien any recommendations on this or setting client-output-buffer-limit
Client id=3675 addr=192.168.1.185:55178 laddr=192.168.7.131:6379 fd=13 name= age=2391 idle=2 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=8 omem=37617856 tot-mem=37655672 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
1:M 13 Feb 2024 23:09:53.741 # Client id=4818 addr=192.168.1.185:59572 laddr=192.168.7.131:6379 fd=13 name= age=18 idle=3 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=15 omem=39031144 tot-mem=39033128 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
1:M 13 Feb 2024 23:10:38.476 # Client id=4838 addr=192.168.1.185:38746 laddr=192.168.7.131:6379 fd=13 name= age=22 idle=5 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=16384 rbp=0 obl=0 oll=16 omem=40456576 tot-mem=40473920 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
Setting redis-cli config set client-output-buffer-limit "pubsub 512m 128m 0"
cleared up these logs, and ack rate while lower than expected has been more stable
Julien, Samuel, 99% sure this is due to network interupts leading to a disconnect of the pubsub that is not reconnecting properly leading to low ack rates and odd data structure.
How can we proceed to fix this, still seeing it in 5.12.33, happy to open an issue for that?
Julien there seems to be an issue that causes the ack rtate to drop off, was seeing 12+ hours of >15/s then it just died, dropped to <0.5/s and didnt recover untill I flushed redis
this error started right as the ack rate dropped and directly corresponds to an ingest rate drop in elastic.
This error seems directly related to the ACK rate drop off, it seems like locking or pubsub is also playing a factor
Julien -- Please advise π
errors{}.message = Error in store update event
errors{}.message = Update event must contains a valid previous patch
{
"category": "APP",
"errors": [
{
"attributes": {
"genre": "TECHNICAL",
"http_status": 500
},
"message": "Error in store update event",
"name": "DATABASE_ERROR",
"stack": "DATABASE_ERROR: Error in store update event\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at DatabaseError (/opt/opencti/build/src/config/errors.js:58:48)\n at storeUpdateEvent (/opt/opencti/build/src/database/redis.ts:504:11)\n at updateAttributeMetaResolved (/opt/opencti/build/src/database/middleware.js:1966:27)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at addStixRefRelationship (/opt/opencti/build/src/domain/stixRefRelationship.js:70:3)"
},
{
"attributes": {
"genre": "BUSINESS",
"http_status": 500
},
"message": "Update event must contains a valid previous patch",
"name": "UNSUPPORTED_ERROR",
"stack": "UNSUPPORTED_ERROR: Update event must contains a valid previous patch\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at UnsupportedError (/opt/opencti/build/src/config/errors.js:83:51)\n at buildStixUpdateEvent (/opt/opencti/build/src/database/redis.ts:467:11)\n at buildUpdateEvent (/opt/opencti/build/src/database/redis.ts:493:10)\n at storeUpdateEvent (/opt/opencti/build/src/database/redis.ts:498:21)\n at updateAttributeMetaResolved (/opt/opencti/build/src/database/middleware.js:1966:27)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at addStixRefRelationship (/opt/opencti/build/src/domain/stixRefRelationship.js:70:3)"
}
],
"inner_relation_creation": 0,
"level": "warn",
"message": "Error in store update event",
"operation": "StixRefRelationshipAdd",
"operation_query": "mutation StixRefRelationshipAdd($input:StixRefRelationshipAddInput!){stixRefRelationshipAdd(input:$input){id standard_id entity_type parent_types}}",
"size": 302,
"time": 224,
"timestamp": "2024-02-28T19:18:20.055Z",
"type": "WRITE_ERROR",
"user": {
"applicant_id": "6a4b11e1-90ca-4e42-ba42-db7bc7f7d505",
"call_retry_number": "1",
"group_ids": [],
"ip": "192.168.6.192",
"organization_ids": [],
"socket": "query",
"user_id": "6a4b11e1-90ca-4e42-ba42-db7bc7f7d505",
"user_metadata": {}
},
"version": "5.12.33"
}
Samuel//Julien - To give a solid example, after running fine since ~12 PM PT 2/27. At 6:32AM PT 2/28 there was a drop in ack rate to almost 0.
When this happened there are the following symptoms:
All of this is inline with https://github.com/OpenCTI-Platform/opencti/issues/3513 - 5.8; 5.9, 5.10, 5.11, 5.12 - Have not fixed this
Where can I continue to look, how can I get the ack rate back up? @richard-julien (sorry for the @ but its been a month sir)
I @MaxwellDPS, i still thinking that this kind of behavior (too many lock, performance drop, Redis disconnect due to large event) can be due to some of your data. We know that we are quite slow to ingest some cases where a mono source of messages try to create a large number of connected elements.
I personally work to find a solution to improve that case but for now the best solution to mitigate/workaround that behavior is to work on data sharding by changing the approach of the custom connectors you may have developed or deploy the same connector on different time range. I know this situation is frustrating but solving this will really require to dig in into your data and your connectors.
Some errors you have looks unrelated to this lock/perf problem (like Update event must contains a valid previous patch). Are you able to share your error logs?
https://github.com/OpenCTI-Platform/opencti/issues/6285 Possibly related
Wanted to update and close this out, these ongoing rate issues were part of #7436.
@richard-julien I really appreciate the help with this one! π₯
Description
Having a very hard time loading data to the platform, uploading any data takes a long time to process. 50K items in the queue can take > 3 hours, this leads to seeing an ack rate sustained of ~.7/s. There is not a manageable way to import this data before it begins to backlog
Environment
None of these seem resource saturated, I am as scaled as I can and the platform is nearly not able to import data
Reproducible Steps
Steps to create the smallest reproducible scenario: N/A
Expected Output
ACK rate that imports data fast enough to not buildup a backlog
Actual Output
Additional information
Resource usage
Screenshots (optional)