OpenCTI-Platform / opencti

Open Cyber Threat Intelligence Platform
https://opencti.io
Other
6.28k stars 930 forks source link

Error on deletion of entities with many relationships #6357

Closed iFrozenPhoenix closed 6 months ago

iFrozenPhoenix commented 7 months ago

Description

When trying to delete an entity i receive the error "Fail to execute engine pagination - no reason provided". It only happens on certain entities.

Environment

  1. OS (where OpenCTI server runs): debian
  2. OpenCTI version: 6.0.5
  3. OpenCTI client: web
  4. Other environment details:

Reproducible Steps

  1. Go to entities
  2. Find the entity and select it
  3. Try to delete it
  4. Go to tasks and see the error "Fail to execute engine pagination - no reason provided"

Expected Output

Deletion is successful

Actual Output

Deletion fails with the error message "Fail to execute engine pagination - no reason provided"

Additional information

Screenshots (optional)

nino-filigran commented 7 months ago

hey @iFrozenPhoenix do you happen to know fir which entity type it happens or does it seem to happen randomly? Additionally, are you trying to delete a bunch of them at once or only one by one? If it's one by one, do you delete them by entering in the entity and then delete it or do you delete directly from the list?

iFrozenPhoenix commented 7 months ago

The entities I have problems are of type Malware. If I enter into the entity and try to delete it nothing happens. If I try to delete the entity via Bulk Deletion I see the mentioned error. It doesn't affect all Malwares only certain.

iFrozenPhoenix commented 7 months ago

I was able to isolate the octi error messages on the backend, see the following.

{"category":"APP","errors":[{"attributes":{"genre":"TECHNICAL","http_status":500,"query":{"_source":true,"body":{"query":{"bool":{"must":[{"bool":{"minimum_should_match":1,"should":[{"bool":{"minimum_should_match":1,"should":[{"multi_match":{"fields":["entity_type.keyword","parent_types.keyword"],"query":"RetentionRule"}}]}}]}}],"must_not":[]}},"size":500,"sort":[{"standard_id.keyword":"asc"}]},"index":["opencti_internal_objects*","opencti_stix_meta_objects*","opencti_stix_domain_objects*","opencti_stix_cyber_observables*","opencti_inferred_entities*"],"track_total_hits":true},"root_cause":"rejected_execution_exception"},"message":"Fail to execute engine pagination","name":"DATABASE_ERROR","stack":"DATABASE_ERROR: Fail to execute engine pagination\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at DatabaseError (/opt/opencti/build/src/config/errors.js:58:48)\n    at /opt/opencti/build/src/database/engine.js:2731:15\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at retentionHandler (/opt/opencti/build/src/manager/retentionManager.js:62:28)\n    at /opt/opencti/build/src/manager/retentionManager.js:90:9\n    at lTt.#runHandlerAndScheduleTimeout (/opt/opencti/build/node_modules/set-interval-async/dist/set-interval-async-timer.cjs:36:13)\n    at Timeout._onTimeout (/opt/opencti/build/node_modules/set-interval-async/dist/set-interval-async-timer.cjs:29:13)"},{"message":"search_phase_execution_exception: [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@3e261c74 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 87.9micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924960]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7df037c7 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@2a876b9 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7c5b247a on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 941.7micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924970]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@72905239 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 418.6micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1389, completed tasks = 65924973]]","name":"ResponseError","stack":"ResponseError: search_phase_execution_exception: [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@3e261c74 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 87.9micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924960]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7df037c7 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@2a876b9 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7c5b247a on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 941.7micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924970]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@72905239 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 418.6micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1389, completed tasks = 65924973]]\n    at onBody (/opt/opencti/build/node_modules/@opensearch-project/opensearch/lib/Transport.js:425:23)\n    at IncomingMessage.onEnd (/opt/opencti/build/node_modules/@opensearch-project/opensearch/lib/Transport.js:340:11)\n    at IncomingMessage.emit (node:events:530:35)\n    at endReadableNT (node:internal/streams/readable:1696:12)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)"}],"level":"error","manager":"RETENTION_MANAGER","message":"Fail to execute engine pagination","timestamp":"2024-03-14T17:03:07.280Z","version":"6.0.5"}
{"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--89230b56-befd-4387-936c-5e4b0e266f15","relationship--570df7f7-f2d5-4de9-ba74-4adfa99e980f","de5f2e30-f2fa-4c43-915d-37b86ad981b3","f9357ba9-85ff-49ed-947e-d7b48a6d0b6a"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n    at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2884:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createRelation (/opt/opencti/build/src/database/middleware.js:2892:16)\n    at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":1172,"time":26056,"timestamp":"2024-03-14T17:03:36.072Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["117df96d-c92f-4e18-8e22-be2f440859ca"],"ip":"::ffff:10.0.0.67","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.5"}
iFrozenPhoenix commented 7 months ago

I could find out a bit more. The problem seems to occur with any entity if it has more then 1000 relations. If the relations are deleted in advance the deletion of the entity also works. If I understand it correctly the mass selection and deletion deletes each object (Relation / Entity) one by one. For me it looks like there is a difference on how the relations are deleted if I delete the entity compared to if I delete the relations by mass selecting them).

With Version 5.12.* it worked without problems.

nino-filigran commented 7 months ago

Thanks for the investigation! I'm reaching out internally to seek for advice and help on this one.

iFrozenPhoenix commented 7 months ago

I tried adjusting my oopensearch settings, especially the thread_pool.search.size parameter. I significantly increased it and now the error message is a different one ("Execution timeout, too many concurrent call on the same entities - no reason provided") when I try to delete such a entity.

nino-filigran commented 7 months ago

cc @richard-julien

richard-julien commented 7 months ago

Increase thread_pool.search.size is a good move has the error in your logs is generated when opensearch is under to much load. So basically we have to improve your cluster capacity or opensearch memory/threads figures.

Error "Execution timeout, too many concurrent call on the same entities" is completely different. Its an error generated by opencti when too many operation try to be done on the same element in a short period of time. A quicker opensearch can mitigate this problem but if you have a lot of workers working on the same elements and you try at the same time to delete you can definitely have this kind of error. However after a bit of time you should be able to delete the element if no other process try to work on it.

iFrozenPhoenix commented 7 months ago

Currently I'm running OCTI with the following setup. Machine 16GB Ram, 8 vCPUs (ARM Ampere). Single Node Opensearch with Xms6G Xmx6G., 5 - 10 (Sometimes I change this; depending on ingestion) OCTI Workers, 1 Platform for UI and 1 Platform for Ingestion / API (If I increase the no of replicas (k8s) performance significantly drops, don't know why, other topic.)

The Opensearch config (only modifications) I use is

thread_pool.search.min_queue_size: 10000
thread_pool.search.max_queue_size: 10000
thread_pool.search.queue_size: 10000
thread_pool.search.size: 32

Befor version 6 I didn't apply the custom Opensearch config and had xms3G and xmx3G and it ran very smooth. Now after applying the above config it runs again, at least nearly smooth, but I have still the problem with the pagination error (Eventually the Execution Timeout, too many concurrent requests was sent in a moment where the ingestion rate was very high.

After I have applied the above config I see again the following ingestion metrics:

Screenshot from 2024-03-15 20-58-21

and the UI is again relatively smooth, despite of some dashboard widgets with relations filtering.

Could you eventually give me a suggestion what else I could try?.

I have the following metrics: Disk throughput is 2k iops / (50 - 100k), disk bandwidth 100 Mbps / ca. 1000 Mbps, Opensearch CPU Usage ( 105 - 230 % / 500 left; total 800 %), Ram 4 GB Left)

iFrozenPhoenix commented 7 months ago

I have to add, currently I have attached a new connector that is currently ingesting 1.2 M new entities, therefore the high no in bundles to be processed. Under normal circumstances the platform is significantly more performant. A few hundred bundles are processed in no time.

iFrozenPhoenix commented 7 months ago

Additionally, opensearch's _cat/thread_pool?v query with (active, queued, rejected) does not show a single rejection.

jborozco commented 6 months ago

cc @richard-julien do you have any suggestion ?

richard-julien commented 6 months ago

As your platform is growing and maybe the data your ingesting also evolve you start to have some contention problem on your opensearch. A less effective opensearch will introduce some latencies and too many concurrent requests error when a lot of workers works on the same object and the opensearch is slower than expected to process everything.

Today the solution can be to decrease the number of workers and assume that you will be a bit slower to ingest your data. Or you need to move to a more powerful opensearch, moving to a cluster approach. Even if you move to a bigger opensearch you can still have some "too many concurrent" times to times when the platform process very highly connected data. Its something we work on to limit this kind of locking.