Open simonbjorzen-ts opened 1 week ago
It appears that it is indeed the rules engine overloading the process. I have separated the rules engine to run in separate containers as recommended by the cluster documentation. However, the issue remains that the containers become unresponsive when processing the rules.
I've requested support internally, I'll come back to you.
Thanks, Also seeing this error in the separate platform that is running the connector manager:
{"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["report--redacted_uuid","report--redacted_uuid"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"GraphQLError: Execution timeout, too many concurrent call on the same entities\n at error (/opt/opencti/build/src/config/errors.js:7:10)\n at LockTimeoutError (/opt/opencti/build/src/config/errors.js:109:51)\n at createEntityRaw (/opt/opencti/build/src/database/middleware.js:3085:13)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at createEntity (/opt/opencti/build/src/database/middleware.js:3096:16)\n at addReport (/opt/opencti/build/src/domain/report.js:121:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"Unspecified","query_attributes":[[{"arguments":[[{"is_empty":true,"name":"input","type":"Variable"}]],"name":"reportAdd"}]],"size":1567515,"source":"backend","time":61134,"timestamp":"2024-10-31T09:08:26.093Z","type":"WRITE_ERROR","user":{"applicant_id":"redacted_uuid","call_retry_number":"130","group_ids":["redacted_uuid"],"ip":"redacted","organization_ids":[],"socket":"query","user_id":"redacted_uuid","user_metadata":{}},"version":"6.3.7"}
Along with this warning:
{"category":"APP","level":"warn","locks":["{locks}:report--redacted_uuid","{locks}:report--redacted_uuid"],"message":"Extending resources for long processing task","source":"backend","stack":"Error: \n at lockResource (/opt/opencti/build/src/database/redis.ts:367:28)\n at createEntityRaw (/opt/opencti/build/src/database/middleware.js:2945:18)\n at createEntity (/opt/opencti/build/src/database/middleware.js:3096:16)\n at addReport (/opt/opencti/build/src/domain/report.js:121:19)","timestamp":"2024-10-31T09:17:20.361Z","version":"6.3.7"}
Description
Platform containers reach 100% CPU usage and become unresponsive. Causes liveness probe to fail and restarts.
Environment
Reproducible Steps
Steps to create the smallest reproducible scenario:
Expected Output
Containers continue working and are not restarted
Actual Output
Containers sit at 100% CPU utilization and do not respond to health probes. Causing repeated restarts (hundreds per day)
Additional information
Liveness probe configuration, tried increasing timeout but does not work.
No errors in log, final log message is usually an inference similar to this:
Could it be that the rules engine is overloading the system with inferences?
Screenshots (optional)