Platform 100% CPU Usage, unresponsive.

simonbjorzen-ts commented 1 week ago

Description

Platform containers reach 100% CPU usage and become unresponsive. Causes liveness probe to fail and restarts.

Environment

OS (where OpenCTI server runs): Ubuntu 22.04 LTS (RKE2 K8S)
OpenCTI version: 6.3.7
OpenCTI client: frontend
Other environment details:

Reproducible Steps

Steps to create the smallest reproducible scenario:

Start platform
Wait a couple of minutes
Some platform containers reach 100% CPU usage and stop responding
Health probe fails and restarts containers
Repeat

Expected Output

Containers continue working and are not restarted

Actual Output

Containers sit at 100% CPU utilization and do not respond to health probes. Causing repeated restarts (hundreds per day)

Additional information

Liveness probe configuration, tried increasing timeout but does not work.

livenessProbe:
  httpGet:
    path: "/health?health_access_key=redacted"
    port: 8080
  failureThreshold: 3
  initialDelaySeconds: 60
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

No errors in log, final log message is usually an inference similar to this:

{"category":"APP","id":"redacted uuid","level":"info","message":"Upsert inferred relation","relation":{"i_inference_weight":1,"i_rule_location_location":[{"data": redacted},"source":"backend","timestamp":"2024-10-30T14:08:17.918Z","version":"6.3.7"}

Could it be that the rules engine is overloading the system with inferences?

Screenshots (optional)

simonbjorzen-ts commented 1 week ago

It appears that it is indeed the rules engine overloading the process. I have separated the rules engine to run in separate containers as recommended by the cluster documentation. However, the issue remains that the containers become unresponsive when processing the rules.

nino-filigran commented 1 week ago

I've requested support internally, I'll come back to you.

simonbjorzen-ts commented 1 week ago

Thanks, Also seeing this error in the separate platform that is running the connector manager:

{"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["report--redacted_uuid","report--redacted_uuid"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"GraphQLError: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:7:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:109:51)\n    at createEntityRaw (/opt/opencti/build/src/database/middleware.js:3085:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createEntity (/opt/opencti/build/src/database/middleware.js:3096:16)\n    at addReport (/opt/opencti/build/src/domain/report.js:121:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"Unspecified","query_attributes":[[{"arguments":[[{"is_empty":true,"name":"input","type":"Variable"}]],"name":"reportAdd"}]],"size":1567515,"source":"backend","time":61134,"timestamp":"2024-10-31T09:08:26.093Z","type":"WRITE_ERROR","user":{"applicant_id":"redacted_uuid","call_retry_number":"130","group_ids":["redacted_uuid"],"ip":"redacted","organization_ids":[],"socket":"query","user_id":"redacted_uuid","user_metadata":{}},"version":"6.3.7"}

Along with this warning:

{"category":"APP","level":"warn","locks":["{locks}:report--redacted_uuid","{locks}:report--redacted_uuid"],"message":"Extending resources for long processing task","source":"backend","stack":"Error: \n    at lockResource (/opt/opencti/build/src/database/redis.ts:367:28)\n    at createEntityRaw (/opt/opencti/build/src/database/middleware.js:2945:18)\n    at createEntity (/opt/opencti/build/src/database/middleware.js:3096:16)\n    at addReport (/opt/opencti/build/src/domain/report.js:121:19)","timestamp":"2024-10-31T09:17:20.361Z","version":"6.3.7"}

OpenCTI-Platform / opencti