OpenCTI-Platform / opencti

Open Cyber Threat Intelligence Platform
https://opencti.io
Other
6.23k stars 922 forks source link

Platform is hanging/busylooping #6285

Open mahmut-the-guy opened 7 months ago

mahmut-the-guy commented 7 months ago

Description

I'm using the provided docker compose setup. The platform node occasionally hangs, with memory usage indefinitely rising and CPU hovering 100%. Notice that no data ingestion is happening during this.

strace gives the following output repeatedly, which makes me believe this is a locking issue:

[pid 3827868] madvise(0x256aaaf00000, 520192, MADV_DONTFORK <unfinished ...>
[pid 3827893] <... mprotect resumed>)   = 0
[pid 3827891] <... munmap resumed>)     = 0
[pid 3827892] <... munmap resumed>)     = 0
[pid 3827868] <... madvise resumed>)    = 0
[pid 3827894] mmap(0x153b42ec0000, 520192, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
[pid 3827892] mprotect(0x354a3ba80000, 262144, PROT_READ|PROT_WRITE <unfinished ...>
[pid 3827891] mprotect(0x35afdd00000, 262144, PROT_READ|PROT_WRITE <unfinished ...>
[pid 3827892] <... mprotect resumed>)   = 0
[pid 3827868] munmap(0x256aaaf40000, 258048 <unfinished ...>
[pid 3827894] <... mmap resumed>)       = 0x153b42ec0000
[pid 3827891] <... mprotect resumed>)   = 0
[pid 3827868] <... munmap resumed>)     = 0
[pid 3827894] madvise(0x153b42ec0000, 520192, MADV_DONTFORK <unfinished ...>
[pid 3827868] mprotect(0x256aaaf00000, 262144, PROT_READ|PROT_WRITE <unfinished ...>
[pid 3827894] <... madvise resumed>)    = 0
[pid 3827868] <... mprotect resumed>)   = 0
[pid 3827894] munmap(0x153b42f00000, 258048) = 0
[pid 3827894] mprotect(0x153b42ec0000, 262144, PROT_READ|PROT_WRITE) = 0
[pid 3827893] futex(0x7f007d03b924, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 3827892] futex(0x7f007d83e924, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 3827868] futex(0x7ffc64b7f4c4, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 3827894] futex(0x7f0078784e7c, FUTEX_WAIT_PRIVATE, 2147483664, NULL <unfinished ...>
[pid 3827891] futex(0x7ffc64b7f4c4, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 3827868] <... futex resumed>)      = 0
[pid 3827891] futex(0x7f0078784e7c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 3827868] futex(0x7f0078784e7c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 3827894] <... futex resumed>)      = 0
[pid 3827891] <... futex resumed>)      = 1
[pid 3827868] <... futex resumed>)      = 0
[pid 3827891] futex(0x7f007e041924, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 3827868] futex(0x7ffc64b7f4c4, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 3827894] futex(0x7ffc64b7f4c4, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 3827868] <... futex resumed>)      = 0
[pid 3827868] futex(0x7f0078784e7c, FUTEX_WAIT_PRIVATE, 2147483664, NULL <unfinished ...>
[pid 3827894] futex(0x7f0078784e7c, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 3827868] <... futex resumed>)      = 0
[pid 3827894] futex(0x7f007c838924, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 3827868] munmap(0x7f00782c4000, 65536) = 0
[pid 3827868] munmap(0x7f00784a5000, 65536) = 0
[pid 3827868] munmap(0x7f007851d000, 65536) = 0
[pid 3827868] futex(0x7f007d03b924, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 3827893] <... futex resumed>)      = 0
[pid 3827893] futex(0x7f007e858264, FUTEX_WAIT_PRIVATE, 2147483655, NULL <unfinished ...>
[pid 3827868] futex(0x7f007d83e924, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 3827892] <... futex resumed>)      = 0
[pid 3827868] futex(0x7f007e858264, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 3827892] futex(0x7f007e858264, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 3827893] <... futex resumed>)      = 0
[pid 3827868] <... futex resumed>)      = 1
[pid 3827893] munmap(0x16f2fd740000, 159744 <unfinished ...>
[pid 3827892] <... futex resumed>)      = 0
[pid 3827868] getpid( <unfinished ...>
[pid 3827892] futex(0x7f007d83e924, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 3827893] <... munmap resumed>)     = 0

I found this issue https://github.com/nodejs/node/issues/42277, which seems to be very similar to this case.

The platform logs complain about not being able to acquire redis locks and along with some ECONNRESET's to rabbtimq:

{
  "category": "APP",
  "level": "error",
  "manager": "NOTIFICATION_MANAGER",
  "message": "[REDIS] Failed to extend resource",
  "timestamp": "2024-03-06T13:41:12.728Z",
  "version": "5.12.24"
}
{
  "category": "APP",
  "errors": [
    {
      "attributes": {
        "genre": "TECHNICAL",
        "http_status": 500,
        "provider": "Playbook manager"
      },
      "message": "Redis stream consume fail",
      "name": "DATABASE_ERROR",
      "stack": "DATABASE_ERROR: Redis stream consume fail\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at DatabaseError (/opt/opencti/build/src/config/errors.js:58:48)\n    at processStep (/opt/opencti/build/src/database/redis.ts:673:20)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at processingLoop (/opt/opencti/build/src/database/redis.ts:684:13)"
    },
    {
      "message": "Connection is closed.",
      "name": "Error",
      "stack": "Error: Connection is closed.\n    at close (/opt/opencti/build/node_modules/ioredis/built/redis/event_handler.js:189:25)\n    at Socket.<anonymous> (/opt/opencti/build/node_modules/ioredis/built/redis/event_handler.js:165:20)\n    at Object.onceWrapper (node:events:633:26)\n    at Socket.emit (node:events:518:28)\n    at TCP.<anonymous> (node:net:337:12)"
    }
  ],
  "level": "error",
  "message": "Redis stream consume fail",
  "timestamp": "2024-03-06T13:46:03.030Z",
  "version": "5.12.24"
}
{
  "message": "connect ECONNRESET 192.168.0.4:5672",
  "name": "Error",
  "stack": "Error: connect ECONNRESET 192.168.0.4:5672\n    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16)"
}
{
    "category": "APP",
    "level": "warn",
    "locks": ["{locks}:activity_manager_lock"],
    "message": "Failed to unlock resource",
    "timestamp": "2024-03-07T11:54:07.138Z",
    "version": "5.12.24"
}

I reset RabbitMQ and Redis states multiple times to no avail.

Environment

  1. OS: Ubuntu 22.04
  2. OpenCTI version: 5.2.24
richard-julien commented 7 months ago

Locking the cpu usage and so the nodejs event loop is the cause of the locking failure and redis connection reset. Depending of your data volume / processing, you have to size the cpu differently to prevent this kind of situation.

mahmut-the-guy commented 7 months ago

What I meant was it was utilizing 1 CPU at all times, my machine has 8 CPUs, so I don't think it has anything to do with CPU limits (i.e., not enough processing power). Notice that during this time, no data ingestion is made at all. I disconnected all workers and connectors as well.

richard-julien commented 7 months ago

Nodejs is not a multithreaded process, so it will use only 1 cpu and not the 8. Depending of you usage and the cpu capability, it could lead to contention. To resolve this we recommand to move to a clustered deployment like documented here https://docs.opencti.io/latest/deployment/clustering/ Hope it will help

mahmut-the-guy commented 7 months ago

Yes, I explicitly allocated exactly 1 CPU to process for such reason; the thing is, there is no load on the system; no external connectors or workers (so it does not make any sense to scale it). Even if we activated workers and connectors, they won't able to insert any date due to unresponsive platform.

We tried to scale number of platforms too, that does not help either:

CONTAINER ID   NAME                                                                           CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
5bbc85e5114e   opencti_opencti.2.qixtzy5moocvpa640sxqd09il                                    8.06%     577.6MiB / 15.61GiB   3.61%     1.68MB / 273kB    84.3MB / 168kB    16
e2c81d93f48f   opencti_opencti.4.04jlo2zidv3didecuagaqi2pt                                    82.64%    2.128GiB / 15.61GiB   13.63%    20.3MB / 219MB    637MB / 152MB     16
24d5159d676d   opencti_opencti.5.v8fcokys4yngeyjlyznkxz4y2                                    80.87%    1.194GiB / 15.61GiB   7.65%     15MB / 5.79MB     1.1GB / 4.94MB    16
e6bf994a86d3   opencti_opencti.3.39t9ukn8oc6iwje57o7r2zu6x                                    81.89%    4.366GiB / 15.61GiB   27.97%    33.9MB / 417MB    436MB / 408MB     16
f19fbd87df97   opencti_minio.1.j6ch0p5kbipqu699ri7altdr2                                      0.06%     130.3MiB / 15.61GiB   0.81%     15.2MB / 68.9kB   179MB / 16.8MB    13
74cd09b7a602   opencti_opencti.1.mh0owhlm4gayvxrob74h73ykj                                    79.98%    3.909GiB / 15.61GiB   25.04%    53.5MB / 419MB    483MB / 387MB     16
95117a1fe447   opencti_rabbitmq.1.n1p3vbgoyjyory6g4b2l9vb9v                                   1.27%     132.8MiB / 15.61GiB   0.83%     659kB / 611kB     1.43GB / 1.33MB   38
6bf6d2acbcc5   opencti_redis.1.twz1readcco6ex77pl5a27qel                                      0.53%     10.44MiB / 3GiB       0.34%     5.36MB / 11.4MB   574MB / 5.58MB    6
richard-julien commented 6 months ago

Thats not an issue about internal locking. Its an issue on CPU usage of the node. Thats hard to help you on that, you need to try to isolate the problem. I know its hard but you have to scale down everything (no connector, no worker) and check the platform behavior. And there are more that workers that do lot of processing in the platform. Check if you have some background task running, check if you have activated the rule engine and the current state of its processing, etc. When I talk about platform clustering, its more to isolate the process of ingestion / rule engine / front etc in order to scale the different behaviors of the platform.

jasgggit commented 6 months ago

Hello, having an issue with the ingestion of our MISP platform, after some time we get unable to use the platform.

Worker stats page: 4 Connected workers | 651.74K Queued bundles | 55/s Bundles processed | 2.14K/s Read operations | 50/s Write operations | 15.92M Total number of documents

Main | Reports 494 | 395.3K Indicators

System: I have a 96GB RAM / 18 vCPU / 300 GB SSD - Docker single server. 6Gb on elk all other are defaults and the recommended settings for elk are active.

And after seeing this on the log I cannot access the platform:

opencti-rabbitmq-1 | 2024-03-27 22:04:45.890783+00:00 [error] <0.1970.0> closing AMQP connection <0.1970.0> (172.21.0.16:60152 -> 172.21.0.2:5672): opencti-rabbitmq-1 | 2024-03-27 22:04:45.890783+00:00 [error] <0.1970.0> missed heartbeats from client, timeout: 60s

opencti-elasticsearch-1 | {"@timestamp":"2024-03-27T21:27:47.972Z", "log.level": "INFO", "current.health":"YELLOW","message":"Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[opencti_history-000001][0], [.ds-ilm-history-5-2024.03.27-000001][0]]]).","previous.health":"RED","reason":"shards started [[opencti_history-000001][0], [.ds-ilm-history-5-2024.03.27-000001][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[50f5210de1b2][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"OXcMw0BkRdqbkuAHiR4rZQ","elasticsearch.node.id":"1b-fh_HnQa-vo825_RqrNw","elasticsearch.node.name":"50f5210de1b2","elasticsearch.cluster.name":"docker-cluster"}

opencti-opencti-1 | {"category":"APP","errors":[{"attributes":{"genre":"TECHNICAL","http_status":500},"message":"Cannot read properties of undefined (reading '_index')","name":"UNKNOWN_ERROR","stack":"UNKNOWN_ERROR: Cannot read properties of undefined (reading '_index')\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at UnknownError (/opt/opencti/build/src/config/errors.js:76:47)\n at Object._logWithError (/opt/opencti/build/src/config/conf.js:331:23)\n at Object.error (/opt/opencti/build/src/config/conf.js:341:48)\n at Object.willSendResponse (/opt/opencti/build/src/graphql/loggerPlugin.js:119:20)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async Promise.all (index 1)\n at b (/opt/opencti/build/node_modules/apollo-server-core/src/requestPipeline.ts:530:5)\n at processHTTPRequest (/opt/opencti/build/node_modules/apollo-server-core/src/runHttpQuery.ts:437:24)"},{"message":"Cannot read properties of undefined (reading '_index')","name":"TypeError","stack":"TypeError: Cannot read properties of undefined (reading '_index')\n at updateExpectationsNumber (/opt/opencti/build/src/domain/work.js:231:30)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)"}],"inner_relation_creation":0,"level":"error","message":"Platform unmanaged direct error","operation":"addExpectations","operation_query":"mutation addExpectations($id:ID!$expectations:Int){workEdit(id:$id){addExpectations(expectations:$expectations)}}","size":96,"time":3,"timestamp":"2024-03-27T22:01:56.506Z","type":"WRITE_ERROR","user":{"group_ids":["457a9638-db5e-4afe-8455-ce52104977f6"],"ip":"::ffff:172.21.0.12","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"variables":{"expectations":19527,"id":"work_bc9ac5e1-8955-40d5-96f4-5cb5995175c5_2024-03-27T21:27:54.695Z"},"version":"6.0.8"}

opencti-redis-1 | 40:C 27 Mar 2024 22:06:53.376 Fork CoW for RDB: current 4 MB, peak 4 MB, average 2 MB opencti-redis-1 | 1:M 27 Mar 2024 22:06:53.600 Background saving terminated with success

Question: For now I just want to know how can I cluster this using docker compose, to see if its the same issue as above.

Thanks and sorry if not the correct place, but this thread was the most recent I found similar to my issue.

Regards. JG

ckane commented 6 months ago

I am also experiencing similar behavior on my instance, here's an excerpt of the opencti logs around the time of the locking failure:

opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:relationship--bc89973c-32c7-409e-be65-068fb89006c8","{locks}:relationship--49727584-be95-5f85-9f0a-01965d759f23","{locks}:04ce7a5a-4e53-495b-aecc-cc20c764a509","{locks}:6df56458-c917-4321-bcc7-fe07dfedec35"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:17:24.889Z","version":"6.0.9"}
opencti-1  | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--b3005ac8-9396-42a3-9e10-839fd7662a48","relationship--d5576cc7-fbc9-55ae-9b3f-03e635d3b3c7","02312652-b6e7-4d42-b2f1-a6e037dd6eae","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n    at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n    at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":27276,"timestamp":"2024-04-05T21:17:31.937Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.7","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
opencti-1  | {"category":"APP","level":"info","message":"[OPENCTI] Expiration manager will revoke 470 elements","timestamp":"2024-04-05T21:17:40.450Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"info","message":"[OPENCTI-MODULE] Running notification manager (digest)","timestamp":"2024-04-05T21:17:54.179Z","version":"6.0.9"}
opencti-1  | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--4c64f961-df4e-4b64-a2fb-c3792d54aa14","relationship--3163926b-c824-58ff-b3c1-53ce19405b72","5f3a1a81-254a-4329-8e21-a761cca1a084","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n    at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n    at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":28083,"timestamp":"2024-04-05T21:17:58.732Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.13","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
ckane commented 6 months ago

It appears that what seems to happen is that the OpenCti build/back.js process will get into a situation where it is running at 100% of its CPU allocation, almost like it is stuck on a spinlock. The lock (I think) eventually times out as a failure, and then the platform will move forward with some more ingestion until running into the deadlocking condition once again. Since build/back.js is single-threaded, this causes everything else on the system to idle because it isn't making any requests while deadlocked like this.

ckane commented 6 months ago

Also some more logs during failure:

opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:495bd52d-fdd8-43f2-8879-a8cb2309a508","{locks}:indicator--e14c849b-9486-5b9a-8939-b2d7ebcd0bcf"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.481Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:5a97cb52-3488-44d2-8265-ae9dbe236822","{locks}:indicator--e136d19d-b7ad-593b-a590-24479b7cb716"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.564Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:e82a61a0-d21f-432a-8f35-284e49d4a395","{locks}:indicator--e0873c7d-90b8-5f3a-ac8e-489bc7481fed"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.786Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:2fbbee24-723e-46f6-9974-ccbecf2198ec","{locks}:indicator--e215f92a-56ae-55bc-8108-e92396fa080a"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.828Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:21d9f25b-4e8f-436e-83f5-598175364d00","{locks}:indicator--ddc895c0-3b1c-5204-8372-7d571a3d38be"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.904Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:relationship--4c2252a5-d69a-47e2-96d2-0eeb462f387e","{locks}:relationship--e44caef3-bbb1-5eec-9dfc-10778d2d3246","{locks}:e56769cf-0604-449b-9db3-4936c0041089","{locks}:5ea352de-9a2d-4b36-8fab-de77190f06d0"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:20.851Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:relationship--9d180686-b12b-4028-9b3c-39152bbe1568","{locks}:relationship--137d74f0-da5d-562b-8ca0-8cfa41277292","{locks}:3f960d96-e22a-4676-b4b4-d634d464e468","{locks}:6df56458-c917-4321-bcc7-fe07dfedec35"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:21.266Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:file_index_manager_lock"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:21.375Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"info","message":"[OPENCTI-MODULE] Running rule manager from 1712351880443-0","timestamp":"2024-04-05T21:27:26.861Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"info","message":"[STREAM] Starting stream processor at 1712351880443-0 for Rule manager","timestamp":"2024-04-05T21:27:26.861Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"info","message":"[REDIS] Redis 'Rule manager' client ready","timestamp":"2024-04-05T21:27:26.873Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:expired_manager_lock"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:34.104Z","version":"6.0.9"}
opencti-1  | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--1bdba7c3-91f2-416c-9908-b3cdff061730","relationship--2a191689-e184-563b-8d22-41b8fb170723","862f409e-c77d-4f69-b413-9b7c5517acb7","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n    at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n    at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":29188,"timestamp":"2024-04-05T21:27:42.077Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.8","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
opencti-1  | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--16e47284-f878-4880-ae87-e28adddf6ac6","relationship--811b0ccb-3d2a-51f3-82a3-52631c898069","448b8d5c-127c-43cc-867a-eaa6969da0f7","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n    at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n    at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":27776,"timestamp":"2024-04-05T21:27:47.745Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.12","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
opencti-1  | {"category":"APP","level":"error","manager":"NOTIFICATION_MANAGER","message":"[REDIS] Failed to extend resource","timestamp":"2024-04-05T21:27:49.633Z","version":"6.0.9"}
opencti-1  | {"category":"APP","level":"warn","locks":["{locks}:notification_digest_manager_lock"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:49.637Z","version":"6.0.9"}

Where are these {locks}:relationship--49727584-be95-5f85-9f0a-01965d759f23 objects stored?

ckane commented 6 months ago

Where are these {locks}:relationship--49727584-be95-5f85-9f0a-01965d759f23 objects stored?

Ok looks like they are stored in Redis too. So, interesting thing I am noticing, when everything is running normally, if I run the following command I'll get a varying-sized list of keys due to a lot of create/delete(?) activity occurring for {locks} keys: docker compose exec valkey valkey-cli keys {locks}\*

For example:

337) "{locks}:indicator--35a12b80-a04a-427c-9ed6-e113cea1f57b"
338) "{locks}:indicator--38d144eb-1210-4347-a461-8e5207a9b385"
339) "{locks}:rule_engine_settings"
340) "{locks}:indicator--ccba8c15-d81d-4634-8511-1b55f9c887c3"
341) "{locks}:rule_engine_lock"
342) "{locks}:indicator--212c45ae-4b91-4223-9bc3-79cbf80c5846"
343) "{locks}:indicator--ccd215ad-31a5-4868-8bf6-678d2f0d8856"
344) "{locks}:indicator--9eb0b826-8123-4fdf-9aae-cf64c12f4faf"
345) "{locks}:indicator--3980152b-53fb-4a9a-909d-c4595e09c4a0"
346) "{locks}:indicator--f04e1ca4-45a0-4d94-ac53-f99a9111d082"
347) "{locks}:indicator--993e149b-7dc7-4b23-942e-008bda96e199"
348) "{locks}:indicator--8c6364f6-4fdd-452b-b477-74a05251b22d"
349) "{locks}:indicator--4dbb3ee9-4322-4b8b-9798-bade00edad56"
350) "{locks}:indicator--6c3bbbc4-9d96-4183-8baf-09c39849d5fd"
351) "{locks}:indicator--56fa3bf6-62a6-4f57-bbd1-0941f1fe0ffd"
352) "{locks}:indicator--50c91e64-6413-46f4-b3a4-45214867994c"
353) "{locks}:indicator--a93f5adc-9a47-4156-b25f-6dce1e84bd9a"
354) "{locks}:indicator--5ea314d3-7bf7-44ad-985b-712c74cec0da"
355) "{locks}:indicator--0cab01de-56d4-44eb-816a-b32da9bb07d9"
356) "{locks}:indicator--763cb756-5d95-4f9b-992b-ef465f7ee31d"
357) "{locks}:indicator--891b2f6f-9286-4c8e-b825-249365b6b09f"
358) "{locks}:indicator--5e6d6d87-20b9-4d62-bd33-a46956382f6c"
359) "{locks}:indicator--a181b8ad-1bc9-4013-a5c1-74f6406fe995"
360) "{locks}:indicator--9173c1d0-4544-449b-ae26-79783d06bde1"
361) "{locks}:indicator--caa099f5-83b4-43e8-9564-de1a615aeb4e"
362) "{locks}:indicator--b81a85bc-fdae-4014-a7d2-aebbc41f88f5"
363) "{locks}:indicator--20d66b5c-950e-4647-af1e-9704b6f26c77"
364) "{locks}:indicator--3baf30cb-5b6c-4a72-8106-8f4f11416752"
365) "{locks}:indicator--06f803e8-152f-4557-83a2-e78dac79cfae"
366) "{locks}:05696b01-2a02-411b-9bec-7c88c76f3e08"
367) "{locks}:indicator--fd1215d2-f184-4340-ba16-c52cc2d10335"
368) "{locks}:indicator--08cc7831-40b0-4e5b-a7da-902a4461c888"
369) "{locks}:indicator--54b00e41-9dd3-4f40-bffe-b11e38536a1c"
370) "{locks}:indicator--907f7eb7-7410-4b0a-98a7-18a1f2e959a2"
371) "{locks}:indicator--822f0251-26d4-44d9-81f8-e00e3e7e9e59"
372) "{locks}:indicator--63bdc399-5bce-455e-9f00-9f274a771757"
373) "{locks}:indicator--31101938-c8ab-44e2-bbb9-64af94a852ef"
374) "{locks}:indicator--737f5966-d1d2-4d15-839a-68543acbdb76"
375) "{locks}:indicator--49312f40-3d2d-4bb9-8653-bc9286f91917"
376) "{locks}:indicator--dd2d5f70-41b0-4635-9add-e51e35fd3032"
377) "{locks}:indicator--72e1b0a9-e77b-42e7-841f-d6f69f824d5f"
378) "{locks}:indicator--78ee5b03-4f69-479c-a730-290af1ef50a9"
379) "{locks}:indicator--bd489cba-e24d-40fb-9117-afd57c793e8f"
380) "{locks}:8e160bb6-5d5c-4e33-9228-f99c81de5e9d"
381) "{locks}:indicator--633126f7-e1e8-466d-8a27-542d049635e2"

However, when the platform deadlocks and reports Failed to unlock resource, if I run the same command while the system is idle, I get no {locks} keys:

(empty array)

Is it possible that the Failed to unlock resource errors are being created because the relevant lock is being deleted before it is getting unlocked?

Does the platform then mistakenly interpret that as a lock-contention (rather than an attempt to unlock a non-locked resource / or whatever a non-existent redis key signifies) and stall/spin for a timeout duration trying to successfully unlock using a key that's been deleted from redis already?

ckane commented 6 months ago

Of note, we are using node-redlock version 5.0.0-beta2 which was tagged in March 2022.

However, the following PRs related to "lock extension" have since been pulled into the project's main development branch that appear to fix lock extension failures, but no new beta release has been tagged since these PRs were merged:

ckane commented 6 months ago

FWIW, I have mitigated the end-user impact on the GUI front-end by setting up a second opencti/platform service that listens on unexposed port 8070 and then point the worker services at that one, instead of the main front end.

richard-julien commented 6 months ago

Hi @ckane , Multiple things in your message.

100% CPU Usage

Failed to extend resource and unlock resource We still have some contention point in the cpu usage in the platform we tried to resolve. When this event occurs and the cpu usage is too high the platform is not able to extend or liberate the lock in the correct timing, producing this error. For me the problem is to find and fix the cpu usage.

Execution timeout, too many concurrent call on the same entities

This error occurs when the platform tried to ingest on the same element too much. Its a protection mechanism. We have some ideas and ongoing work to improve that part.

Remaining bugs on the library

Maybe we need to move to https://github.com/sesamecare/redlock as the initial library looks now inactive. We will do some test to see if everything is ok with the fork.

I have mitigated the end-user impact on the GUI front-end by setting up a second opencti

Thats exactly what we recommend. Split platform responsible for the ingest and ones responsible for the GUI is a really good point.

ckane commented 6 months ago

Remaining bugs on the library

Maybe we need to move to https://github.com/sesamecare/redlock as the initial library looks now inactive. We will do some test to see if everything is ok with the fork.

Thanks. After a bit of work, I was able to get a build of platform/opencti that used redlock from the main branch of that git repository, rather than the 5.0.0-beta.2 tag in npm registry. It did not appear to fix the behavior, however.

I'll try building a new image using the library you suggested and see if it changes anything. It might also be worth testing again with the non-beta 4.x.x code, but I don't know what unique features in the beta (if any) prompted the update for us.

Still - one of the things that is interesting to me is that when the build/back.js process appears to deadlock, there are no locks present in redis when the failures to unlock occur. Since "unlock" is supposed to be a validated DEL operation, this would suggest that the platform expected there to be a lock, but one didn't exist ... possibly for one of the following reasons (according to my understanding of the redlock algorithm):

ckane commented 6 months ago

Maybe we need to move to https://github.com/sesamecare/redlock as the initial library looks now inactive. We will do some test to see if everything is ok with the fork.

I have created a PR for this in #6622. I am currently testing the changes on a fresh local instance, and if it appears to work fine, I'll try applying these changes to my community instance.

ckane commented 6 months ago

Testing on my community instance (where the deadlocking is reproducible) the change to @sesamecare-oss/redlock@1.2.1 does still work, but the deadlocking problem remains. I recognize that the problem can likely be a result of resource-contention on the system...wondering if we need to identify this failure and determine a more graceful recovery....

ckane commented 6 months ago

Ok, I have 4 opencti/platform instances running now, and I dropped my worker count to 3 on this system - and that seems to have cleared up a lot of the above errors. However, I still encounter deadlocking (I even encounter it still when there's only 1 worker), and with some of the other errors cleared, it seems like it often may coincide with really large bundles from the DIGITALSIDE.IT misp-feed, resulting in error messages like the attached (which I've extracted from the logs and formatted with jq). error_out.json

ckane commented 5 months ago

Also I am curious why the doubling of the timeout value in this line of code?

https://github.com/OpenCTI-Platform/opencti/blob/master/opencti-platform/opencti-graphql/src/database/redis.ts#L336

ckane commented 5 months ago

So, I learned a few more things. On my system, I have a misp-feed connector that is pulling from the MISP feed at https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/. This is a feed that reports IOCs associated with malware artifacts (identified by hash). When I scanned my logs after leaving it running (and increasingly accumulating more work than consumed), I found the following DIGITALSIDE reports were being repeatedly injected into the work queue:

   7122 59ce0baba11893f90527fc951ac69912
    865 9b6c3518a91d23ed77504b5416bfb5b3
   4037 a73ddd6ec22462db955439f665cad4e6
     22 eec5c6c219535fba3a0492ea8118b397
     32 fbe51695e97a45dc61967dc3241a37dc

59ce0baba11893f90527fc951ac69912 corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/d92a4442-03b3-419b-8250-210cb30cc12f.json a73ddd6ec22462db955439f665cad4e6 corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/e928e37d-597f-4c5b-bac6-b1365c337614.json 9b6c3518a91d23ed77504b5416bfb5b3 corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/78c421fd-0fed-4038-a6ad-ec177477b3fa.json eec5c6c219535fba3a0492ea8118b397 corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/2cfee480-f6a0-4b29-a1b8-77f90fa2cbed.json

These 4 MISP JSON documents are the only ones that are one or more MB in size, which is considerably larger than the typical size of a report in that collection: a few kB or 10-20kB. This suggests that the problem is likely reproducible with very large-sized report ingestions.

fbe51695e97a45dc61967dc3241a37dc corresponds to https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/90fd343c-3631-4d60-b627-1935ef7b4938.json and is 740kB in size.

I suspect that what's happening in my environment is probably something like this:

  1. The misp-feed connector injects the malware reports into the queue for the workers
  2. A worker attempts to lock the resources, which is a very long list of locks created in redis
  3. The worker then attempts to upload the entire bundle of locked resources
  4. This takes a really really long time, eventually failing for some reason...(maybe being killed by extending the locks past their max lifespan?)
  5. During this time, the updates to the DigitalSide MISP report aren't reflected in the platform, so the relevant connector runs another update cycle and injects another copy of these malware reports into the queue
  6. Any other workers that try to pick up these other copies of the malware reports will be blocked by the worker that acquired the locks first - resulting in a stalling condition for all workers until the work/locks fail and allow the worker holding the locks to release them

I have chosen to address this in my platform by stopping the misp-feed connector I had watching the above DigitalSide.IT misp folder, and the backlog is slowly clearing now. I plan on leaving it disabled until the entire backlog is drained.

Do we think there's a way to possibly use finer-grained locking?

Do we think there's a way to tell the system that, in the event of locking contention encountered, to move to the next work item rather than blocking the worker?

Is there a way for the misp-feed connector to inspect the backlog, in addition to the persistent database, and determine if it is about to inject duplicate work?

Forgive me if the above make incorrect interpretations about how the system is functioning.

During the stalling condition, the count of the "connected workers" in the Ingestion->Connectors view drops to 1, which is indicating to me that the workers are stalled blocking on the locks to free up, and are no longer even reporting heartbeats to the platform.

ckane commented 5 months ago

I suppose a primary cause for these problems is to answer the question "Why are these 5 reports failing to ingest properly?", and whether that is able to be reproduced on other systems (or not). I suspect it would probably work ok if I gave it more time...so maybe the answer here is to extend the locking retry counts and/or some other timeout values like that, and let the connector ingest again with the higher thresholds?

richard-julien commented 5 months ago

Is these reports using 100% cpu at ingestion?

ckane commented 5 months ago

I think so - there's 100% CPU utilization in one of the build/back.js processes, while the system is stalling.

ckane commented 5 months ago

Also, I am seeing the following in my logs (when using docker compose logs for viewing the worker + platform logs together). I have cut the first log line down in the below output, but it actually comes from the opencti/platform connector, and occurs right before the workers start handing another report ingestion and stall again. It consists of a really really long list of entities (I presume all of the entities consumed from the really large misp-feed report) and fails to write it to the database for some reason:

.... "indicator--430b1a34-1624-56f6-b6bc-fb99d8e009f0","ipv4-addr--7ebffc4b-c8c6-55d7-bb69-47489c85314d","indicator--ba436871-745c-53f9-aace-da4b02953a0b","url--28305391-0922-55c9-a20c-81adfb3fcad7","indicator--a5f9b024-5954-54a5-be91-1be6f9f2985f","ipv4-addr--617ffd42-a3a6-5d03-9519-f0372e16634a","indicator--3675b1fe-74a5-51d9-94b8-48b260e48220","url--ef646e34-9ac9-5cf2-90ed-612c92d217cb","indicator--050924c3-b7ee-57a7-ae2e-6c4d9b620665","ipv4-addr--ef6e238e-a441-576a-923f-5357b7f92d01","indicator--b15ab93a-907d-5a66-bafc-32b00a0c22c6","url--5560a825-b68d-564e-9827-92a026e14833","indicator--22857d15-9fb6-5797-a58e-961919c29bff","ipv4-addr--a3777c46-4fb6-5739-8236-a19975d3417d","indicator--871bfcfa-9255-55d8-9ddd-2122427edd34","url--359a51c7-5d11-58a0-83e4-cbfb9056f684","indicator--b975336b-ab17-5337-b320-a275552995f7","ipv4-addr--a0c97011-84ec-5c45-b578-552712641f81","indicator--c74e2e81-db3a-5bb8-a8fc-f580f938a138","url--34d5909b-2876-5909-8451-6e3c9188dc86","indicator--9423489e-08d2-56c1-bb4c-06b3534798d2","ipv4-addr--d788d306-bd53-5598-b1c2-b6308477ad82","indicator--3409c7fd-db3c-5054-9138-6410b780a58f","url--eab4b451-9b99-55ec-b3eb-fd27674aadd6","indicator--973c4043-46fa-51ae-b4f0-a3be6f25300d","ipv4-addr--248b9d49-8544-5d4e-8b2b-778014d71ec8","indicator--553feb71-1d05-5195-b9b1-e2e23630f54f","url--2bf6475e-5a82-5890-b1d4-cd457f407f22","indicator--849f1504-aaf5-5cf8-8dbb-9a8a551631c8","ipv4-addr--a9098db4-67c8-5795-af3b-75453f751ce5","indicator--f41df02f-102d-537d-80e8-f65f8713266a","url--a67be361-2c72-56b9-af8e-954f9fa76a86","indicator--10e6a00c-0efd-5430-9125-fc49bcee7e40","ipv4-addr--95e3720b-4ab7-52d4-a3e4-8e9c39248039","indicator--3f52eba7-0d09-5417-b6fd-cc6b673abe46","url--3c234ddf-1630-5ca6-8432-1ffc8bad8d24","indicator--e6274748-9819-5790-8480-7f01e16e4b24","ipv4-addr--f471e5b6-3275-5343-bb70-14ef71993032","indicator--f65ee92a-b94b-5094-b9ee-03fc69930dd4","url--7032c335-8954-5903-b5c3-11ab0cd0892c","indicator--9ea03abf-4969-5ba8-9b03-922ffe3bff42","ipv4-addr--f32670ca-111d-5005-ba29-81b469575e77","indicator--b7561b99-a6e3-5d96-90b5-8bcbdafe5188","url--aa2d0d58-da1e-57dd-839f-141e923caed8","indicator--ebb4470b-4de6-55d8-9762-d9c1c8a44a0d","ipv4-addr--97ea2588-bcb9-5703-9075-e751602d33f0","indicator--2deff448-c766-56d5-8a73-300f7eb15fe4","url--19718aa6-e59c-5872-9549-9e5cf74029e7","indicator--0017a894-e4bf-5317-8e30-bd663d926ea7","ipv4-addr--97783a10-7b08-5f66-b819-f31341411f0f","indicator--a0fc8de3-9ebf-5115-b5ee-ff3f346158db","url--39394df5-1e72-5c5f-8d83-16ae1999e1d3","indicator--6524353a-69cb-5b93-8af2-f029b6fa30c4","ipv4-addr--a4c77de9-6bf1-529e-857e-aab8c9feb7dd","indicator--5e14ab82-4321-5e6a-aa9d-4268660582f7","url--0b424b74-4540-50d1-856d-b1fa5b0164d6","indicator--f57e36bc-7ca3-5214-be24-af3acd569a35","ipv4-addr--27a39c75-d256-5032-9ff3-48116ee7c646","indicator--21d4c561-99c4-566d-989d-465f38fe1f38","url--5e91b52f-1ac9-5f07-872b-d27f1b379ccc","indicator--b1760b21-ba6b-52ef-890c-dcd1310481d6","ipv4-addr--0e79572b-94dd-59fc-8cfe-ef3e42e209e8","indicator--ddc70bac-f75d-55f2-8bb4-ecc17d65e4ff","url--64f85b1c-e4e4-5ce9-9271-2498d71c3436","indicator--a4e31391-482c-591d-916c-61a7dc8559e9","ipv4-addr--dc5e7136-49b3-5eb4-9b6b-cbd13777410f","indicator--f975a978-358b-5332-897c-00b7f9368959","url--4ef45f54-6cae-56cc-b671-24501e3137a8","indicator--21ec7e56-f719-5f0f-8d82-9a2b4e227f7e","ipv4-addr--c161e84b-b5de-5987-9b2b-be19b3424683","indicator--a3591e70-9aa1-5b15-9cd2-4b3a57f79f2c","url--c2ebe54c-0930-5975-a7a9-66aea93fcb9e","indicator--a6ad73cc-6464-5f6c-b040-ae9efae90169","ipv4-addr--41ef62e2-9617-5dad-84a2-c247a55c6e24","indicator--d0a3f088-dc47-53d9-a963-31d9a9ff269a","url--0920b3e3-fbf7-5ce6-b423-a3272e44ddd4","indicator--f6c74f53-6cbb-5bcd-851d-472e2b47f19a","ipv4-addr--5c58b2c9-04ef-5481-ae2d-fea8190314ae","indicator--6a192b60-d952-5cd2-8bc9-5d2876ec7964","url--f14fa5b0-b52b-51c7-a7ae-73103c85fb18","indicator--b8e89838-103a-5b88-8d06-f003901c4af9","ipv4-addr--1e6f2f1b-7778-5ae4-8f40-d0233c7c0069","indicator--63f47d34-c680-5f60-a048-f872ca172ace","url--5e0b640b-6cbc-5361-84d3-ab3381fc1c96","indicator--f6ea5839-1c0b-5bd1-a9c6-4f4bcb4c22b3","ipv4-addr--93d5c645-65e5-5828-8f6f-16f16b48eb26","indicator--aef4236d-013a-580c-9603-778e9328af82","url--d80ef5fc-27d4-5cb0-b91a-c502933a77e7","indicator--4b688dbc-6978-5964-89c0-a07a6c35ec1f","ipv4-addr--bf1413d9-1079-523d-a1e9-97c02daa0621","indicator--65339f3d-d029-5c69-adbf-6f95b19ae7ad","url--04d1be52-c362-59e5-9002-705ff7a5ae7a","indicator--a2fcdea7-fcaf-56d2-b905-639cc679a5b6","ipv4-addr--c4508f4e-6619-56b5-baec-2cc36accb1d7","indicator--e5cd3d1b-2e58-539a-a2a0-5c2a57bf1718","url--c415cd4c-4e9b-58ef-9283-f0844ce1eb6d","indicator--e29a24d8-4818-54cc-b9c2-488fcdeee5ac","ipv4-addr--f0325f38-fbd6-5028-94b4-6e54aa1abf01","indicator--8d88864a-927d-5a34-8cb7-221c19c2cba7","url--fca0ccf7-e3bb-5598-a6c4-f66f18df5438","indicator--e2f3d118-1135-51f7-bc35-cb8fcbe7901d","ipv4-addr--475ef6c5-3bda-5150-8f7f-07c2c6aaae95","indicator--ad338569-eefe-5037-9917-636ba397597c","url--26e56cfe-0c21-5efc-b56b-8af3581f2000","indicator--19756c22-3a43-5ede-84f2-8b731ab5606d","ipv4-addr--8f6da827-eb85-50e3-ab7f-55464817e131","indicator--dcf8f3e1-b8cd-5374-9171-b8b7e085d523","url--00ffedff-f6ae-5bd6-96a4-0cc9c9640738","indicator--6798880a-f26d-565f-bcae-b0b128ddbced","ipv4-addr--ab0aff98-08c1-5ddd-927e-ff9c14ad9706","indicator--6a7679eb-f3cd-5427-a324-2a35d70c6cca","url--d70b518d-451c-5f9a-bd44-13e4571ea828","indicator--e6815325-2040-59cd-b178-a9ff164c0e9f","ipv4-addr--500c857b-0920-5c02-a63a-2242612c10bc","indicator--2671821c-3788-52b7-9c7f-6cd85c45c08e","url--5f8bb273-bd16-5b38-acda-08e6c98123b8","indicator--549f75c9-2b1f-5ead-a20d-9a9660b81565","ipv4-addr--1884c1cc-6f89-5e29-b7e1-162d9bae8e71","indicator--07d415ec-9618-530f-a7dc-3a10574e1dd9","url--1cdd3bf3-7630-55a8-98df-ee6d066870ab","indicator--d404eeef-acc7-5bc8-9d69-27ab8070c765","ipv4-addr--8a70ac3b-403d-5da5-99df-7883c0a5a1ae","indicator--28d73ac1-c160-5a38-aced-4f3c76b7775a","url--91205748-dcc3-5bf6-999d-31b28b93f3c3","indicator--e879a159-a996-5302-ad9e-2fbd6f910de8","ipv4-addr--258b3959-70cd-5702-8a11-88b23adaa045","indicator--22ae5c26-aceb-5b13-8c89-016182b1e00a","url--294664fd-bd00-5452-befe-d477bffd6b71","indicator--481d9d10-6c01-5182-b9dc-8a8f2184ef6b","ipv4-addr--9c39e097-a31a-587e-a325-4ef9327b5afc","indicator--7e9fe6be-98c5-56ba-be6d-fc55cc0d663f","url--f0d5f38f-ab10-5188-ba4b-fd48654a18f7","indicator--2e2571b8-93fc-520b-a449-d674d5759d4d","ipv4-addr--5d047228-bb75-5618-bd58-30f6975a6473","indicator--8268f156-a1d4-5139-9e05-1117b1550141","url--10fa84ec-c17d-5354-b721-0ebc815bad14","indicator--74fc49c2-6fe1-5500-9b07-5260162581b3","url--f0c543cd-e5c0-5bc3-84d5-e53fbca25ecb","text--da8a6c59-7184-5235-8b03-0033385b02ff"],"published":"2023-06-24T00:00:00.000Z","report_types":["misp-event"],"revoked":null,"stix_id":"report--879fc70e-60d3-559b-bab9-bc55d1d4aab2","update":false,"x_opencti_reliability":null,"x_opencti_stix_ids":null,"x_opencti_workflow_id":null},"unresolvedIds":["indicator--846d17a6-6cc9-50ab-9e81-fe6a5ff40116","indicator--d43d5d42-bf36-55c6-a635-6a8d50cfa04f","indicator--1ea9c640-1f8a-5735-be2c-02b0d98aa74d"]},"message":"Element(s) not found","name":"MISSING_REFERENCE_ERROR","stack":"MISSING_REFERENCE_ERROR: Element(s) not found\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at MissingReferenceError (/opt/opencti/build/src/config/errors.js:117:48)\n    at inputResolveRefsFn (/opt/opencti/build/src/database/middleware.js:746:13)\n    at createEntityRaw (/opt/opencti/build/src/database/middleware.js:3151:25)\n    at createEntity (/opt/opencti/build/src/database/middleware.js:3279:16)\n    at addReport (/opt/opencti/build/src/domain/report.js:122:19)"}],"inner_relation_creation":1,"level":"warn","message":"Element(s) not found","operation":"ReportAdd","operation_query":"mutation ReportAdd($input:ReportAddInput!){reportAdd(input:$input){id standard_id entity_type parent_types}}","size":1121675,"time":73028,"timestamp":"2024-04-08T14:47:57.163Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"2","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.31","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
worker-3       | {"timestamp": "2024-04-08T14:47:59.420133Z", "level": "INFO", "name": "worker", "message": "Message reprocess", "taskName": null, "attributes": {"tag": 39443, "count": 2}}
worker-3       | {"timestamp": "2024-04-08T14:47:59.442888Z", "level": "INFO", "name": "api", "message": "Importing an object", "taskName": null, "attributes": {"type": "report", "id": "report--879fc70e-60d3-559b-bab9-bc55d1d4aab2"}}worker-3       | {"timestamp": "2024-04-08T14:47:59.443200Z", "level": "INFO", "name": "api", "message": "Creating Report", "taskName": null, "attributes": {"name": "DigitalSide Malware report: MD5: 9b6c3518a91d23ed77504b5416bfb5b3"}}
worker-4       | {"timestamp": "2024-04-08T14:48:01.454648Z", "level": "INFO", "name": "api", "message": "Report expectation", "taskName": null, "attributes": {"work_id": "work_acef2b25-306a-4bc5-b79d-a1e845e47b95_2024-04-08T14:47:27.695Z"}}worker-4       | {"timestamp": "2024-04-08T14:48:02.281154Z", "level": "INFO", "name": "worker", "message": "Message processed, thread terminated", "taskName": null}worker-4       | {"timestamp": "2024-04-08T14:48:02.281372Z", "level": "INFO", "name": "worker", "message": "Message acknowledged", "taskName": null, "attributes": {"tag": 7303}}
ckane commented 5 months ago

Also, I actually think that the platform is consuming incoming data from all of the other connectors fairly promptly, so after I made the changes to cluster the front-end, making the user-visible process not talked to by the workers, this problem is now limited to preventing these few really large reports from DigitalSide from being ingested.

I'm going to update my fork of the docker project to deploy in this manner by default, because it is a much more resilient and responsive approach than using a single platform process.

jasgggit commented 5 months ago

@ckane can you kindly share that with us?

AS we have a large MISP dataset ... and my question was indeed how can I 'cluster' this. Many thanks for all of your (team) insights. Regards. JG

ckane commented 5 months ago

@ckane can you kindly share that with us?

AS we have a large MISP dataset ... and my question was indeed how can I 'cluster' this. Many thanks for all of your (team) insights. Regards. JG

I maintain a public fork of the docker repository where my docker-compose.yml is populated with a bunch of OSINT connectors and some additional tweaks that I've found useful. I'll follow up later tonight when I update it, but here's a rough idea for you:

...
  opencti-int:
    image: opencti/platform:${OPENCTI_VERSION}
    environment:
      - NODE_OPTIONS=--max-old-space-size=8096
      - APP__PORT=8070
      - APP__BASE_URL=${OPENCTI_BASE_URL}
      - APP__ADMIN__EMAIL=${OPENCTI_ADMIN_EMAIL}
      - APP__ADMIN__PASSWORD=${OPENCTI_ADMIN_PASSWORD}
      - APP__ADMIN__TOKEN=${OPENCTI_ADMIN_TOKEN}
      - APP__APP_LOGS__LOGS_LEVEL=info
      - REDIS__HOSTNAME=dragonfly
      - REDIS__PORT=6379
      - REDIS__TRIMMING=100000
      - ELASTICSEARCH__URL=http://elasticsearch:9200
      - ELASTICSEARCH__MAX_CONCURRENCY=50
      - MINIO__ENDPOINT=${S3_ENDPOINT}
      - MINIO__PORT=443
      - MINIO__USE_SSL=true
      - MINIO__USE_AWS_ROLE=true
      - MINIO__ACCESS_KEY=
      - MINIO__SECRET_KEY=
      - MINIO__BUCKET_NAME=${S3_BUCKET}
      - RABBITMQ__HOSTNAME=rabbitmq
      - RABBITMQ__PORT=5672
      - RABBITMQ__PORT_MANAGEMENT=15672
      - RABBITMQ__MANAGEMENT_SSL=false
      - RABBITMQ__USERNAME=${RABBITMQ_DEFAULT_USER}
      - RABBITMQ__PASSWORD=${RABBITMQ_DEFAULT_PASS}
      - SMTP__HOSTNAME=${SMTP_HOSTNAME}
      - SMTP__PORT=25
      - PROVIDERS__LOCAL__STRATEGY=LocalStrategy
    depends_on:
      - dragonfly
      - elasticsearch
      - rabbitmq
    restart: always
    deploy:
      mode: replicated
      replicas: 3
      endpoint_mode: dnsrr
  opencti:
    image: opencti/platform:${OPENCTI_VERSION}
    environment:
      - NODE_OPTIONS=--max-old-space-size=8096
      - APP__PORT=8080
      - APP__BASE_URL=${OPENCTI_BASE_URL}
      - APP__ADMIN__EMAIL=${OPENCTI_ADMIN_EMAIL}
      - APP__ADMIN__PASSWORD=${OPENCTI_ADMIN_PASSWORD}
      - APP__ADMIN__TOKEN=${OPENCTI_ADMIN_TOKEN}
      - APP__APP_LOGS__LOGS_LEVEL=info
      - REDIS__HOSTNAME=dragonfly
      - REDIS__PORT=6379
      - REDIS__TRIMMING=100000
      - ELASTICSEARCH__URL=http://elasticsearch:9200
      - ELASTICSEARCH__MAX_CONCURRENCY=50
      - MINIO__ENDPOINT=${S3_ENDPOINT}
      - MINIO__PORT=443
      - MINIO__USE_SSL=true
      - MINIO__USE_AWS_ROLE=true
      - MINIO__ACCESS_KEY=
      - MINIO__SECRET_KEY=
      - MINIO__BUCKET_NAME=${S3_BUCKET}
      - RABBITMQ__HOSTNAME=rabbitmq
      - RABBITMQ__PORT=5672
      - RABBITMQ__PORT_MANAGEMENT=15672
      - RABBITMQ__MANAGEMENT_SSL=false
      - RABBITMQ__USERNAME=${RABBITMQ_DEFAULT_USER}
      - RABBITMQ__PASSWORD=${RABBITMQ_DEFAULT_PASS}
      - SMTP__HOSTNAME=${SMTP_HOSTNAME}
      - SMTP__PORT=25
      - PROVIDERS__LOCAL__STRATEGY=LocalStrategy
    ports:
      - "8080:8080"
    depends_on:
      - dragonfly
      - elasticsearch
      - rabbitmq
    restart: always
  worker:
    image: opencti/worker:${OPENCTI_VERSION}
    environment:
      - OPENCTI_URL=http://opencti-int.:8070
      - OPENCTI_TOKEN=${OPENCTI_ADMIN_TOKEN}
      - WORKER_LOG_LEVEL=info
    depends_on:
      - opencti
    deploy:
      mode: replicated
      replicas: 6
    restart: always
...

Note the following:

If you wanted to, you can also cluster the opencti front-end too (for a lot of end user traffic), you just (I think) need to use vip instead of dnsrr and then have the service listen on ports that are different from what is exposed, because the port exposed on the host-side (the one proxy/cloudfront will connect to) is going to conflict with all of the internal services.

But, as I said, I'll work up a modification to my config to incorporate both of these changes, as well as an update to the redis service in light of that project's restrictive licensing changes (I've found that both valkey and dragonflydb work really well, the latter having some performance benefits over valkey/redis, when tuned appropriately).

jasgggit commented 5 months ago

@ckane Thank you so mucho for your insights and help. I understood (I think) how can I manage our deployment, since we use it on a internal server. I'll look for your updates on your fork too.

Regards. JG

richard-julien commented 5 months ago

Thanks for all the information @ckane , i will investigate all of this asap

ckane commented 5 months ago

@ckane Thank you so mucho for your insights and help. I understood (I think) how can I manage our deployment, since we use it on a internal server. I'll look for your updates on your fork too.

Regards. JG

Hi @jasgggit - I was mistaken about how to scale the UI service (opencti) horizontally using docker compose: It turns out that the only way I could make it work was to make it an internal-only service, and then expose a new nginx load balancer that forwards the traffic (and uses DNS round-robin, like the opencti-int container).

Here is my fork of https://github.com/OpenCTI-Platform/docker, which handles scaling for both services via the .env variables OPENCTI_FRONT_END_REPLICAS and OPENCTI_BACK_END_REPLICAS, which default to 1 and 3, respectively:

Note that it requires the nginx.conf file to be in the current working directory, and also that this configuration presumes using AWS S3, so if you want the Minio server, it'll need to be added & both opencti instances need reconfiguration.

ckane commented 5 months ago

@jasgggit - I made a small change to my setup again (repository branch is updated): Since when the worker nodes cause the opencti-int nodes to stall, it causes these to become unavailable for any connectors. In order to fix this, I created a separate (third) opencti/platform cluster just for the connectors to talk to, with a default replica count of 1. This way, the connectors will always have an available platform node regardless of what the worker nodes are doing.

In my setup, the unavailability of the opencti-int nodes was causing the connectors to go red on the Ingestion->Connectors page, and thus making them often not be able to inject more work. This change fixes that issue.

jasgggit commented 5 months ago

@ckane thank you for that, I ll look into it, Regards. JG

richard-julien commented 5 months ago

@ckane , CPU problem reproduced, tracked in https://github.com/OpenCTI-Platform/opencti/issues/6638