OpenCTI-Platform / opencti

Open Cyber Threat Intelligence Platform
https://opencti.io
Other
6.38k stars 942 forks source link

Platform 100% CPU Usage, unresponsive. #8833

Open simonbjorzen-ts opened 1 week ago

simonbjorzen-ts commented 1 week ago

Description

Platform containers reach 100% CPU usage and become unresponsive. Causes liveness probe to fail and restarts.

Environment

  1. OS (where OpenCTI server runs): Ubuntu 22.04 LTS (RKE2 K8S)
  2. OpenCTI version: 6.3.7
  3. OpenCTI client: frontend
  4. Other environment details:

Reproducible Steps

Steps to create the smallest reproducible scenario:

  1. Start platform
  2. Wait a couple of minutes
  3. Some platform containers reach 100% CPU usage and stop responding
  4. Health probe fails and restarts containers
  5. Repeat

Expected Output

Containers continue working and are not restarted

Actual Output

Containers sit at 100% CPU utilization and do not respond to health probes. Causing repeated restarts (hundreds per day)

Additional information

Liveness probe configuration, tried increasing timeout but does not work.

livenessProbe:
  httpGet:
    path: "/health?health_access_key=redacted"
    port: 8080
  failureThreshold: 3
  initialDelaySeconds: 60
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

No errors in log, final log message is usually an inference similar to this:

{"category":"APP","id":"redacted uuid","level":"info","message":"Upsert inferred relation","relation":{"i_inference_weight":1,"i_rule_location_location":[{"data": redacted},"source":"backend","timestamp":"2024-10-30T14:08:17.918Z","version":"6.3.7"}

Could it be that the rules engine is overloading the system with inferences?

Screenshots (optional)

simonbjorzen-ts commented 1 week ago

It appears that it is indeed the rules engine overloading the process. I have separated the rules engine to run in separate containers as recommended by the cluster documentation. However, the issue remains that the containers become unresponsive when processing the rules.

nino-filigran commented 1 week ago

I've requested support internally, I'll come back to you.

simonbjorzen-ts commented 1 week ago

Thanks, Also seeing this error in the separate platform that is running the connector manager:

{"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["report--redacted_uuid","report--redacted_uuid"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"GraphQLError: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:7:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:109:51)\n    at createEntityRaw (/opt/opencti/build/src/database/middleware.js:3085:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createEntity (/opt/opencti/build/src/database/middleware.js:3096:16)\n    at addReport (/opt/opencti/build/src/domain/report.js:121:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"Unspecified","query_attributes":[[{"arguments":[[{"is_empty":true,"name":"input","type":"Variable"}]],"name":"reportAdd"}]],"size":1567515,"source":"backend","time":61134,"timestamp":"2024-10-31T09:08:26.093Z","type":"WRITE_ERROR","user":{"applicant_id":"redacted_uuid","call_retry_number":"130","group_ids":["redacted_uuid"],"ip":"redacted","organization_ids":[],"socket":"query","user_id":"redacted_uuid","user_metadata":{}},"version":"6.3.7"}

Along with this warning:

{"category":"APP","level":"warn","locks":["{locks}:report--redacted_uuid","{locks}:report--redacted_uuid"],"message":"Extending resources for long processing task","source":"backend","stack":"Error: \n    at lockResource (/opt/opencti/build/src/database/redis.ts:367:28)\n    at createEntityRaw (/opt/opencti/build/src/database/middleware.js:2945:18)\n    at createEntity (/opt/opencti/build/src/database/middleware.js:3096:16)\n    at addReport (/opt/opencti/build/src/domain/report.js:121:19)","timestamp":"2024-10-31T09:17:20.361Z","version":"6.3.7"}