Open mahmut-the-guy opened 7 months ago
Locking the cpu usage and so the nodejs event loop is the cause of the locking failure and redis connection reset. Depending of your data volume / processing, you have to size the cpu differently to prevent this kind of situation.
What I meant was it was utilizing 1 CPU at all times, my machine has 8 CPUs, so I don't think it has anything to do with CPU limits (i.e., not enough processing power). Notice that during this time, no data ingestion is made at all. I disconnected all workers and connectors as well.
Nodejs is not a multithreaded process, so it will use only 1 cpu and not the 8. Depending of you usage and the cpu capability, it could lead to contention. To resolve this we recommand to move to a clustered deployment like documented here https://docs.opencti.io/latest/deployment/clustering/ Hope it will help
Yes, I explicitly allocated exactly 1 CPU to process for such reason; the thing is, there is no load on the system; no external connectors or workers (so it does not make any sense to scale it). Even if we activated workers and connectors, they won't able to insert any date due to unresponsive platform.
We tried to scale number of platforms too, that does not help either:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
5bbc85e5114e opencti_opencti.2.qixtzy5moocvpa640sxqd09il 8.06% 577.6MiB / 15.61GiB 3.61% 1.68MB / 273kB 84.3MB / 168kB 16
e2c81d93f48f opencti_opencti.4.04jlo2zidv3didecuagaqi2pt 82.64% 2.128GiB / 15.61GiB 13.63% 20.3MB / 219MB 637MB / 152MB 16
24d5159d676d opencti_opencti.5.v8fcokys4yngeyjlyznkxz4y2 80.87% 1.194GiB / 15.61GiB 7.65% 15MB / 5.79MB 1.1GB / 4.94MB 16
e6bf994a86d3 opencti_opencti.3.39t9ukn8oc6iwje57o7r2zu6x 81.89% 4.366GiB / 15.61GiB 27.97% 33.9MB / 417MB 436MB / 408MB 16
f19fbd87df97 opencti_minio.1.j6ch0p5kbipqu699ri7altdr2 0.06% 130.3MiB / 15.61GiB 0.81% 15.2MB / 68.9kB 179MB / 16.8MB 13
74cd09b7a602 opencti_opencti.1.mh0owhlm4gayvxrob74h73ykj 79.98% 3.909GiB / 15.61GiB 25.04% 53.5MB / 419MB 483MB / 387MB 16
95117a1fe447 opencti_rabbitmq.1.n1p3vbgoyjyory6g4b2l9vb9v 1.27% 132.8MiB / 15.61GiB 0.83% 659kB / 611kB 1.43GB / 1.33MB 38
6bf6d2acbcc5 opencti_redis.1.twz1readcco6ex77pl5a27qel 0.53% 10.44MiB / 3GiB 0.34% 5.36MB / 11.4MB 574MB / 5.58MB 6
Thats not an issue about internal locking. Its an issue on CPU usage of the node. Thats hard to help you on that, you need to try to isolate the problem. I know its hard but you have to scale down everything (no connector, no worker) and check the platform behavior. And there are more that workers that do lot of processing in the platform. Check if you have some background task running, check if you have activated the rule engine and the current state of its processing, etc. When I talk about platform clustering, its more to isolate the process of ingestion / rule engine / front etc in order to scale the different behaviors of the platform.
Hello, having an issue with the ingestion of our MISP platform, after some time we get unable to use the platform.
Main | Reports 494 | 395.3K Indicators
System: I have a 96GB RAM / 18 vCPU / 300 GB SSD - Docker single server. 6Gb on elk all other are defaults and the recommended settings for elk are active.
Question: For now I just want to know how can I cluster this using docker compose, to see if its the same issue as above.
Thanks and sorry if not the correct place, but this thread was the most recent I found similar to my issue.
Regards. JG
I am also experiencing similar behavior on my instance, here's an excerpt of the opencti
logs around the time of the locking failure:
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:relationship--bc89973c-32c7-409e-be65-068fb89006c8","{locks}:relationship--49727584-be95-5f85-9f0a-01965d759f23","{locks}:04ce7a5a-4e53-495b-aecc-cc20c764a509","{locks}:6df56458-c917-4321-bcc7-fe07dfedec35"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:17:24.889Z","version":"6.0.9"}
opencti-1 | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--b3005ac8-9396-42a3-9e10-839fd7662a48","relationship--d5576cc7-fbc9-55ae-9b3f-03e635d3b3c7","02312652-b6e7-4d42-b2f1-a6e037dd6eae","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":27276,"timestamp":"2024-04-05T21:17:31.937Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.7","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
opencti-1 | {"category":"APP","level":"info","message":"[OPENCTI] Expiration manager will revoke 470 elements","timestamp":"2024-04-05T21:17:40.450Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"info","message":"[OPENCTI-MODULE] Running notification manager (digest)","timestamp":"2024-04-05T21:17:54.179Z","version":"6.0.9"}
opencti-1 | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--4c64f961-df4e-4b64-a2fb-c3792d54aa14","relationship--3163926b-c824-58ff-b3c1-53ce19405b72","5f3a1a81-254a-4329-8e21-a761cca1a084","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":28083,"timestamp":"2024-04-05T21:17:58.732Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.13","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
It appears that what seems to happen is that the OpenCti build/back.js
process will get into a situation where it is running at 100% of its CPU allocation, almost like it is stuck on a spinlock. The lock (I think) eventually times out as a failure, and then the platform will move forward with some more ingestion until running into the deadlocking condition once again. Since build/back.js
is single-threaded, this causes everything else on the system to idle because it isn't making any requests while deadlocked like this.
Also some more logs during failure:
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:495bd52d-fdd8-43f2-8879-a8cb2309a508","{locks}:indicator--e14c849b-9486-5b9a-8939-b2d7ebcd0bcf"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.481Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:5a97cb52-3488-44d2-8265-ae9dbe236822","{locks}:indicator--e136d19d-b7ad-593b-a590-24479b7cb716"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.564Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:e82a61a0-d21f-432a-8f35-284e49d4a395","{locks}:indicator--e0873c7d-90b8-5f3a-ac8e-489bc7481fed"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.786Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:2fbbee24-723e-46f6-9974-ccbecf2198ec","{locks}:indicator--e215f92a-56ae-55bc-8108-e92396fa080a"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.828Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:21d9f25b-4e8f-436e-83f5-598175364d00","{locks}:indicator--ddc895c0-3b1c-5204-8372-7d571a3d38be"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:18.904Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:relationship--4c2252a5-d69a-47e2-96d2-0eeb462f387e","{locks}:relationship--e44caef3-bbb1-5eec-9dfc-10778d2d3246","{locks}:e56769cf-0604-449b-9db3-4936c0041089","{locks}:5ea352de-9a2d-4b36-8fab-de77190f06d0"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:20.851Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:relationship--9d180686-b12b-4028-9b3c-39152bbe1568","{locks}:relationship--137d74f0-da5d-562b-8ca0-8cfa41277292","{locks}:3f960d96-e22a-4676-b4b4-d634d464e468","{locks}:6df56458-c917-4321-bcc7-fe07dfedec35"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:21.266Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:file_index_manager_lock"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:21.375Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"info","message":"[OPENCTI-MODULE] Running rule manager from 1712351880443-0","timestamp":"2024-04-05T21:27:26.861Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"info","message":"[STREAM] Starting stream processor at 1712351880443-0 for Rule manager","timestamp":"2024-04-05T21:27:26.861Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"info","message":"[REDIS] Redis 'Rule manager' client ready","timestamp":"2024-04-05T21:27:26.873Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:expired_manager_lock"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:34.104Z","version":"6.0.9"}
opencti-1 | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--1bdba7c3-91f2-416c-9908-b3cdff061730","relationship--2a191689-e184-563b-8d22-41b8fb170723","862f409e-c77d-4f69-b413-9b7c5517acb7","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":29188,"timestamp":"2024-04-05T21:27:42.077Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.8","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
opencti-1 | {"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--16e47284-f878-4880-ae87-e28adddf6ac6","relationship--811b0ccb-3d2a-51f3-82a3-52631c898069","448b8d5c-127c-43cc-867a-eaa6969da0f7","6df56458-c917-4321-bcc7-fe07dfedec35"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2921:13)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at createRelation (/opt/opencti/build/src/database/middleware.js:2929:16)\n at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":666,"time":27776,"timestamp":"2024-04-05T21:27:47.745Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.12","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
opencti-1 | {"category":"APP","level":"error","manager":"NOTIFICATION_MANAGER","message":"[REDIS] Failed to extend resource","timestamp":"2024-04-05T21:27:49.633Z","version":"6.0.9"}
opencti-1 | {"category":"APP","level":"warn","locks":["{locks}:notification_digest_manager_lock"],"message":"Failed to unlock resource","timestamp":"2024-04-05T21:27:49.637Z","version":"6.0.9"}
Where are these {locks}:relationship--49727584-be95-5f85-9f0a-01965d759f23
objects stored?
Where are these
{locks}:relationship--49727584-be95-5f85-9f0a-01965d759f23
objects stored?
Ok looks like they are stored in Redis too. So, interesting thing I am noticing, when everything is running normally, if I run the following command I'll get a varying-sized list of keys due to a lot of create/delete(?) activity occurring for {locks}
keys:
docker compose exec valkey valkey-cli keys {locks}\*
For example:
337) "{locks}:indicator--35a12b80-a04a-427c-9ed6-e113cea1f57b"
338) "{locks}:indicator--38d144eb-1210-4347-a461-8e5207a9b385"
339) "{locks}:rule_engine_settings"
340) "{locks}:indicator--ccba8c15-d81d-4634-8511-1b55f9c887c3"
341) "{locks}:rule_engine_lock"
342) "{locks}:indicator--212c45ae-4b91-4223-9bc3-79cbf80c5846"
343) "{locks}:indicator--ccd215ad-31a5-4868-8bf6-678d2f0d8856"
344) "{locks}:indicator--9eb0b826-8123-4fdf-9aae-cf64c12f4faf"
345) "{locks}:indicator--3980152b-53fb-4a9a-909d-c4595e09c4a0"
346) "{locks}:indicator--f04e1ca4-45a0-4d94-ac53-f99a9111d082"
347) "{locks}:indicator--993e149b-7dc7-4b23-942e-008bda96e199"
348) "{locks}:indicator--8c6364f6-4fdd-452b-b477-74a05251b22d"
349) "{locks}:indicator--4dbb3ee9-4322-4b8b-9798-bade00edad56"
350) "{locks}:indicator--6c3bbbc4-9d96-4183-8baf-09c39849d5fd"
351) "{locks}:indicator--56fa3bf6-62a6-4f57-bbd1-0941f1fe0ffd"
352) "{locks}:indicator--50c91e64-6413-46f4-b3a4-45214867994c"
353) "{locks}:indicator--a93f5adc-9a47-4156-b25f-6dce1e84bd9a"
354) "{locks}:indicator--5ea314d3-7bf7-44ad-985b-712c74cec0da"
355) "{locks}:indicator--0cab01de-56d4-44eb-816a-b32da9bb07d9"
356) "{locks}:indicator--763cb756-5d95-4f9b-992b-ef465f7ee31d"
357) "{locks}:indicator--891b2f6f-9286-4c8e-b825-249365b6b09f"
358) "{locks}:indicator--5e6d6d87-20b9-4d62-bd33-a46956382f6c"
359) "{locks}:indicator--a181b8ad-1bc9-4013-a5c1-74f6406fe995"
360) "{locks}:indicator--9173c1d0-4544-449b-ae26-79783d06bde1"
361) "{locks}:indicator--caa099f5-83b4-43e8-9564-de1a615aeb4e"
362) "{locks}:indicator--b81a85bc-fdae-4014-a7d2-aebbc41f88f5"
363) "{locks}:indicator--20d66b5c-950e-4647-af1e-9704b6f26c77"
364) "{locks}:indicator--3baf30cb-5b6c-4a72-8106-8f4f11416752"
365) "{locks}:indicator--06f803e8-152f-4557-83a2-e78dac79cfae"
366) "{locks}:05696b01-2a02-411b-9bec-7c88c76f3e08"
367) "{locks}:indicator--fd1215d2-f184-4340-ba16-c52cc2d10335"
368) "{locks}:indicator--08cc7831-40b0-4e5b-a7da-902a4461c888"
369) "{locks}:indicator--54b00e41-9dd3-4f40-bffe-b11e38536a1c"
370) "{locks}:indicator--907f7eb7-7410-4b0a-98a7-18a1f2e959a2"
371) "{locks}:indicator--822f0251-26d4-44d9-81f8-e00e3e7e9e59"
372) "{locks}:indicator--63bdc399-5bce-455e-9f00-9f274a771757"
373) "{locks}:indicator--31101938-c8ab-44e2-bbb9-64af94a852ef"
374) "{locks}:indicator--737f5966-d1d2-4d15-839a-68543acbdb76"
375) "{locks}:indicator--49312f40-3d2d-4bb9-8653-bc9286f91917"
376) "{locks}:indicator--dd2d5f70-41b0-4635-9add-e51e35fd3032"
377) "{locks}:indicator--72e1b0a9-e77b-42e7-841f-d6f69f824d5f"
378) "{locks}:indicator--78ee5b03-4f69-479c-a730-290af1ef50a9"
379) "{locks}:indicator--bd489cba-e24d-40fb-9117-afd57c793e8f"
380) "{locks}:8e160bb6-5d5c-4e33-9228-f99c81de5e9d"
381) "{locks}:indicator--633126f7-e1e8-466d-8a27-542d049635e2"
However, when the platform deadlocks and reports Failed to unlock resource
, if I run the same command while the system is idle, I get no {locks}
keys:
(empty array)
Is it possible that the Failed to unlock resource
errors are being created because the relevant lock is being deleted before it is getting unlocked?
Does the platform then mistakenly interpret that as a lock-contention (rather than an attempt to unlock a non-locked resource / or whatever a non-existent redis key signifies) and stall/spin for a timeout duration trying to successfully unlock using a key that's been deleted from redis already?
Of note, we are using node-redlock
version 5.0.0-beta2
which was tagged in March 2022.
However, the following PRs related to "lock extension" have since been pulled into the project's main development branch that appear to fix lock extension failures, but no new beta release has been tagged since these PRs were merged:
FWIW, I have mitigated the end-user impact on the GUI front-end by setting up a second opencti/platform
service that listens on unexposed port 8070
and then point the worker
services at that one, instead of the main front end.
Hi @ckane , Multiple things in your message.
100% CPU Usage
Failed to extend resource and unlock resource We still have some contention point in the cpu usage in the platform we tried to resolve. When this event occurs and the cpu usage is too high the platform is not able to extend or liberate the lock in the correct timing, producing this error. For me the problem is to find and fix the cpu usage.
Execution timeout, too many concurrent call on the same entities
This error occurs when the platform tried to ingest on the same element too much. Its a protection mechanism. We have some ideas and ongoing work to improve that part.
Remaining bugs on the library
Maybe we need to move to https://github.com/sesamecare/redlock as the initial library looks now inactive. We will do some test to see if everything is ok with the fork.
I have mitigated the end-user impact on the GUI front-end by setting up a second opencti
Thats exactly what we recommend. Split platform responsible for the ingest and ones responsible for the GUI is a really good point.
Remaining bugs on the library
Maybe we need to move to https://github.com/sesamecare/redlock as the initial library looks now inactive. We will do some test to see if everything is ok with the fork.
Thanks. After a bit of work, I was able to get a build of platform/opencti
that used redlock
from the main
branch of that git repository, rather than the 5.0.0-beta.2
tag in npm
registry. It did not appear to fix the behavior, however.
I'll try building a new image using the library you suggested and see if it changes anything. It might also be worth testing again with the non-beta 4.x.x
code, but I don't know what unique features in the beta (if any) prompted the update for us.
Still - one of the things that is interesting to me is that when the build/back.js
process appears to deadlock, there are no locks present in redis
when the failures to unlock occur. Since "unlock" is supposed to be a validated DEL
operation, this would suggest that the platform expected there to be a lock, but one didn't exist ... possibly for one of the following reasons (according to my understanding of the redlock algorithm):
SET
never completed successfully (or never executed at all) (unreported lock failure)SET
never completed successfully (or never executed at all), and then another SET
for the same lock key-name with a different value occurred, and then was unlocked successfully, while the initial lock was assumed to still be in place (basically later lock clobbers earlier lock)SET
completed successfully, but then the key expired before unlock could occur (lock vanishes before unlock)SET
completed successfully, and then was successfully unlocked, but the platform still thinks it is locked and is trying to unlock it (double-unlock)Maybe we need to move to https://github.com/sesamecare/redlock as the initial library looks now inactive. We will do some test to see if everything is ok with the fork.
I have created a PR for this in #6622. I am currently testing the changes on a fresh local instance, and if it appears to work fine, I'll try applying these changes to my community instance.
Testing on my community instance (where the deadlocking is reproducible) the change to @sesamecare-oss/redlock@1.2.1
does still work, but the deadlocking problem remains. I recognize that the problem can likely be a result of resource-contention on the system...wondering if we need to identify this failure and determine a more graceful recovery....
Ok, I have 4 opencti/platform
instances running now, and I dropped my worker
count to 3
on this system - and that seems to have cleared up a lot of the above errors. However, I still encounter deadlocking (I even encounter it still when there's only 1 worker), and with some of the other errors cleared, it seems like it often may coincide with really large bundles from the DIGITALSIDE.IT
misp-feed, resulting in error messages like the attached (which I've extracted from the logs and formatted with jq
).
error_out.json
Also I am curious why the doubling of the timeout value in this line of code?
So, I learned a few more things. On my system, I have a misp-feed
connector that is pulling from the MISP feed at https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/. This is a feed that reports IOCs associated with malware artifacts (identified by hash). When I scanned my logs after leaving it running (and increasingly accumulating more work than consumed), I found the following DIGITALSIDE reports were being repeatedly injected into the work queue:
7122 59ce0baba11893f90527fc951ac69912
865 9b6c3518a91d23ed77504b5416bfb5b3
4037 a73ddd6ec22462db955439f665cad4e6
22 eec5c6c219535fba3a0492ea8118b397
32 fbe51695e97a45dc61967dc3241a37dc
59ce0baba11893f90527fc951ac69912
corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/d92a4442-03b3-419b-8250-210cb30cc12f.json
a73ddd6ec22462db955439f665cad4e6
corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/e928e37d-597f-4c5b-bac6-b1365c337614.json
9b6c3518a91d23ed77504b5416bfb5b3
corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/78c421fd-0fed-4038-a6ad-ec177477b3fa.json
eec5c6c219535fba3a0492ea8118b397
corresponds to: https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/2cfee480-f6a0-4b29-a1b8-77f90fa2cbed.json
These 4 MISP JSON documents are the only ones that are one or more MB in size, which is considerably larger than the typical size of a report in that collection: a few kB or 10-20kB. This suggests that the problem is likely reproducible with very large-sized report ingestions.
fbe51695e97a45dc61967dc3241a37dc
corresponds to https://osint.digitalside.it/Threat-Intel/digitalside-misp-feed/90fd343c-3631-4d60-b627-1935ef7b4938.json and is 740kB in size.
I suspect that what's happening in my environment is probably something like this:
misp-feed
connector injects the malware reports into the queue for the workers
I have chosen to address this in my platform by stopping the misp-feed
connector I had watching the above DigitalSide.IT misp folder, and the backlog is slowly clearing now. I plan on leaving it disabled until the entire backlog is drained.
Do we think there's a way to possibly use finer-grained locking?
Do we think there's a way to tell the system that, in the event of locking contention encountered, to move to the next work item rather than blocking the worker?
Is there a way for the misp-feed
connector to inspect the backlog, in addition to the persistent database, and determine if it is about to inject duplicate work?
Forgive me if the above make incorrect interpretations about how the system is functioning.
During the stalling condition, the count of the "connected workers" in the Ingestion->Connectors view drops to 1
, which is indicating to me that the workers are stalled blocking on the locks to free up, and are no longer even reporting heartbeats to the platform.
I suppose a primary cause for these problems is to answer the question "Why are these 5 reports failing to ingest properly?", and whether that is able to be reproduced on other systems (or not). I suspect it would probably work ok if I gave it more time...so maybe the answer here is to extend the locking retry counts and/or some other timeout values like that, and let the connector ingest again with the higher thresholds?
Is these reports using 100% cpu at ingestion?
I think so - there's 100% CPU utilization in one of the build/back.js
processes, while the system is stalling.
Also, I am seeing the following in my logs (when using docker compose logs
for viewing the worker + platform logs together). I have cut the first log line down in the below output, but it actually comes from the opencti/platform
connector, and occurs right before the workers start handing another report ingestion and stall again. It consists of a really really long list of entities (I presume all of the entities consumed from the really large misp-feed report) and fails to write it to the database for some reason:
.... "indicator--430b1a34-1624-56f6-b6bc-fb99d8e009f0","ipv4-addr--7ebffc4b-c8c6-55d7-bb69-47489c85314d","indicator--ba436871-745c-53f9-aace-da4b02953a0b","url--28305391-0922-55c9-a20c-81adfb3fcad7","indicator--a5f9b024-5954-54a5-be91-1be6f9f2985f","ipv4-addr--617ffd42-a3a6-5d03-9519-f0372e16634a","indicator--3675b1fe-74a5-51d9-94b8-48b260e48220","url--ef646e34-9ac9-5cf2-90ed-612c92d217cb","indicator--050924c3-b7ee-57a7-ae2e-6c4d9b620665","ipv4-addr--ef6e238e-a441-576a-923f-5357b7f92d01","indicator--b15ab93a-907d-5a66-bafc-32b00a0c22c6","url--5560a825-b68d-564e-9827-92a026e14833","indicator--22857d15-9fb6-5797-a58e-961919c29bff","ipv4-addr--a3777c46-4fb6-5739-8236-a19975d3417d","indicator--871bfcfa-9255-55d8-9ddd-2122427edd34","url--359a51c7-5d11-58a0-83e4-cbfb9056f684","indicator--b975336b-ab17-5337-b320-a275552995f7","ipv4-addr--a0c97011-84ec-5c45-b578-552712641f81","indicator--c74e2e81-db3a-5bb8-a8fc-f580f938a138","url--34d5909b-2876-5909-8451-6e3c9188dc86","indicator--9423489e-08d2-56c1-bb4c-06b3534798d2","ipv4-addr--d788d306-bd53-5598-b1c2-b6308477ad82","indicator--3409c7fd-db3c-5054-9138-6410b780a58f","url--eab4b451-9b99-55ec-b3eb-fd27674aadd6","indicator--973c4043-46fa-51ae-b4f0-a3be6f25300d","ipv4-addr--248b9d49-8544-5d4e-8b2b-778014d71ec8","indicator--553feb71-1d05-5195-b9b1-e2e23630f54f","url--2bf6475e-5a82-5890-b1d4-cd457f407f22","indicator--849f1504-aaf5-5cf8-8dbb-9a8a551631c8","ipv4-addr--a9098db4-67c8-5795-af3b-75453f751ce5","indicator--f41df02f-102d-537d-80e8-f65f8713266a","url--a67be361-2c72-56b9-af8e-954f9fa76a86","indicator--10e6a00c-0efd-5430-9125-fc49bcee7e40","ipv4-addr--95e3720b-4ab7-52d4-a3e4-8e9c39248039","indicator--3f52eba7-0d09-5417-b6fd-cc6b673abe46","url--3c234ddf-1630-5ca6-8432-1ffc8bad8d24","indicator--e6274748-9819-5790-8480-7f01e16e4b24","ipv4-addr--f471e5b6-3275-5343-bb70-14ef71993032","indicator--f65ee92a-b94b-5094-b9ee-03fc69930dd4","url--7032c335-8954-5903-b5c3-11ab0cd0892c","indicator--9ea03abf-4969-5ba8-9b03-922ffe3bff42","ipv4-addr--f32670ca-111d-5005-ba29-81b469575e77","indicator--b7561b99-a6e3-5d96-90b5-8bcbdafe5188","url--aa2d0d58-da1e-57dd-839f-141e923caed8","indicator--ebb4470b-4de6-55d8-9762-d9c1c8a44a0d","ipv4-addr--97ea2588-bcb9-5703-9075-e751602d33f0","indicator--2deff448-c766-56d5-8a73-300f7eb15fe4","url--19718aa6-e59c-5872-9549-9e5cf74029e7","indicator--0017a894-e4bf-5317-8e30-bd663d926ea7","ipv4-addr--97783a10-7b08-5f66-b819-f31341411f0f","indicator--a0fc8de3-9ebf-5115-b5ee-ff3f346158db","url--39394df5-1e72-5c5f-8d83-16ae1999e1d3","indicator--6524353a-69cb-5b93-8af2-f029b6fa30c4","ipv4-addr--a4c77de9-6bf1-529e-857e-aab8c9feb7dd","indicator--5e14ab82-4321-5e6a-aa9d-4268660582f7","url--0b424b74-4540-50d1-856d-b1fa5b0164d6","indicator--f57e36bc-7ca3-5214-be24-af3acd569a35","ipv4-addr--27a39c75-d256-5032-9ff3-48116ee7c646","indicator--21d4c561-99c4-566d-989d-465f38fe1f38","url--5e91b52f-1ac9-5f07-872b-d27f1b379ccc","indicator--b1760b21-ba6b-52ef-890c-dcd1310481d6","ipv4-addr--0e79572b-94dd-59fc-8cfe-ef3e42e209e8","indicator--ddc70bac-f75d-55f2-8bb4-ecc17d65e4ff","url--64f85b1c-e4e4-5ce9-9271-2498d71c3436","indicator--a4e31391-482c-591d-916c-61a7dc8559e9","ipv4-addr--dc5e7136-49b3-5eb4-9b6b-cbd13777410f","indicator--f975a978-358b-5332-897c-00b7f9368959","url--4ef45f54-6cae-56cc-b671-24501e3137a8","indicator--21ec7e56-f719-5f0f-8d82-9a2b4e227f7e","ipv4-addr--c161e84b-b5de-5987-9b2b-be19b3424683","indicator--a3591e70-9aa1-5b15-9cd2-4b3a57f79f2c","url--c2ebe54c-0930-5975-a7a9-66aea93fcb9e","indicator--a6ad73cc-6464-5f6c-b040-ae9efae90169","ipv4-addr--41ef62e2-9617-5dad-84a2-c247a55c6e24","indicator--d0a3f088-dc47-53d9-a963-31d9a9ff269a","url--0920b3e3-fbf7-5ce6-b423-a3272e44ddd4","indicator--f6c74f53-6cbb-5bcd-851d-472e2b47f19a","ipv4-addr--5c58b2c9-04ef-5481-ae2d-fea8190314ae","indicator--6a192b60-d952-5cd2-8bc9-5d2876ec7964","url--f14fa5b0-b52b-51c7-a7ae-73103c85fb18","indicator--b8e89838-103a-5b88-8d06-f003901c4af9","ipv4-addr--1e6f2f1b-7778-5ae4-8f40-d0233c7c0069","indicator--63f47d34-c680-5f60-a048-f872ca172ace","url--5e0b640b-6cbc-5361-84d3-ab3381fc1c96","indicator--f6ea5839-1c0b-5bd1-a9c6-4f4bcb4c22b3","ipv4-addr--93d5c645-65e5-5828-8f6f-16f16b48eb26","indicator--aef4236d-013a-580c-9603-778e9328af82","url--d80ef5fc-27d4-5cb0-b91a-c502933a77e7","indicator--4b688dbc-6978-5964-89c0-a07a6c35ec1f","ipv4-addr--bf1413d9-1079-523d-a1e9-97c02daa0621","indicator--65339f3d-d029-5c69-adbf-6f95b19ae7ad","url--04d1be52-c362-59e5-9002-705ff7a5ae7a","indicator--a2fcdea7-fcaf-56d2-b905-639cc679a5b6","ipv4-addr--c4508f4e-6619-56b5-baec-2cc36accb1d7","indicator--e5cd3d1b-2e58-539a-a2a0-5c2a57bf1718","url--c415cd4c-4e9b-58ef-9283-f0844ce1eb6d","indicator--e29a24d8-4818-54cc-b9c2-488fcdeee5ac","ipv4-addr--f0325f38-fbd6-5028-94b4-6e54aa1abf01","indicator--8d88864a-927d-5a34-8cb7-221c19c2cba7","url--fca0ccf7-e3bb-5598-a6c4-f66f18df5438","indicator--e2f3d118-1135-51f7-bc35-cb8fcbe7901d","ipv4-addr--475ef6c5-3bda-5150-8f7f-07c2c6aaae95","indicator--ad338569-eefe-5037-9917-636ba397597c","url--26e56cfe-0c21-5efc-b56b-8af3581f2000","indicator--19756c22-3a43-5ede-84f2-8b731ab5606d","ipv4-addr--8f6da827-eb85-50e3-ab7f-55464817e131","indicator--dcf8f3e1-b8cd-5374-9171-b8b7e085d523","url--00ffedff-f6ae-5bd6-96a4-0cc9c9640738","indicator--6798880a-f26d-565f-bcae-b0b128ddbced","ipv4-addr--ab0aff98-08c1-5ddd-927e-ff9c14ad9706","indicator--6a7679eb-f3cd-5427-a324-2a35d70c6cca","url--d70b518d-451c-5f9a-bd44-13e4571ea828","indicator--e6815325-2040-59cd-b178-a9ff164c0e9f","ipv4-addr--500c857b-0920-5c02-a63a-2242612c10bc","indicator--2671821c-3788-52b7-9c7f-6cd85c45c08e","url--5f8bb273-bd16-5b38-acda-08e6c98123b8","indicator--549f75c9-2b1f-5ead-a20d-9a9660b81565","ipv4-addr--1884c1cc-6f89-5e29-b7e1-162d9bae8e71","indicator--07d415ec-9618-530f-a7dc-3a10574e1dd9","url--1cdd3bf3-7630-55a8-98df-ee6d066870ab","indicator--d404eeef-acc7-5bc8-9d69-27ab8070c765","ipv4-addr--8a70ac3b-403d-5da5-99df-7883c0a5a1ae","indicator--28d73ac1-c160-5a38-aced-4f3c76b7775a","url--91205748-dcc3-5bf6-999d-31b28b93f3c3","indicator--e879a159-a996-5302-ad9e-2fbd6f910de8","ipv4-addr--258b3959-70cd-5702-8a11-88b23adaa045","indicator--22ae5c26-aceb-5b13-8c89-016182b1e00a","url--294664fd-bd00-5452-befe-d477bffd6b71","indicator--481d9d10-6c01-5182-b9dc-8a8f2184ef6b","ipv4-addr--9c39e097-a31a-587e-a325-4ef9327b5afc","indicator--7e9fe6be-98c5-56ba-be6d-fc55cc0d663f","url--f0d5f38f-ab10-5188-ba4b-fd48654a18f7","indicator--2e2571b8-93fc-520b-a449-d674d5759d4d","ipv4-addr--5d047228-bb75-5618-bd58-30f6975a6473","indicator--8268f156-a1d4-5139-9e05-1117b1550141","url--10fa84ec-c17d-5354-b721-0ebc815bad14","indicator--74fc49c2-6fe1-5500-9b07-5260162581b3","url--f0c543cd-e5c0-5bc3-84d5-e53fbca25ecb","text--da8a6c59-7184-5235-8b03-0033385b02ff"],"published":"2023-06-24T00:00:00.000Z","report_types":["misp-event"],"revoked":null,"stix_id":"report--879fc70e-60d3-559b-bab9-bc55d1d4aab2","update":false,"x_opencti_reliability":null,"x_opencti_stix_ids":null,"x_opencti_workflow_id":null},"unresolvedIds":["indicator--846d17a6-6cc9-50ab-9e81-fe6a5ff40116","indicator--d43d5d42-bf36-55c6-a635-6a8d50cfa04f","indicator--1ea9c640-1f8a-5735-be2c-02b0d98aa74d"]},"message":"Element(s) not found","name":"MISSING_REFERENCE_ERROR","stack":"MISSING_REFERENCE_ERROR: Element(s) not found\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at MissingReferenceError (/opt/opencti/build/src/config/errors.js:117:48)\n at inputResolveRefsFn (/opt/opencti/build/src/database/middleware.js:746:13)\n at createEntityRaw (/opt/opencti/build/src/database/middleware.js:3151:25)\n at createEntity (/opt/opencti/build/src/database/middleware.js:3279:16)\n at addReport (/opt/opencti/build/src/domain/report.js:122:19)"}],"inner_relation_creation":1,"level":"warn","message":"Element(s) not found","operation":"ReportAdd","operation_query":"mutation ReportAdd($input:ReportAddInput!){reportAdd(input:$input){id standard_id entity_type parent_types}}","size":1121675,"time":73028,"timestamp":"2024-04-08T14:47:57.163Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"2","group_ids":["cc7ea4f2-08f4-458a-98ef-bf4769b70f39","89ce1c73-5a8b-4aef-ba35-ef1744f9bdd4","5adea5e7-d5f3-4b71-b555-c0565fd77d09","b6e28f8d-9aaa-4c37-9e98-0e6f46f4c8d2","241d8779-43c1-4dd6-bd55-a164106ae650"],"ip":"::ffff:172.25.0.31","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.9"}
worker-3 | {"timestamp": "2024-04-08T14:47:59.420133Z", "level": "INFO", "name": "worker", "message": "Message reprocess", "taskName": null, "attributes": {"tag": 39443, "count": 2}}
worker-3 | {"timestamp": "2024-04-08T14:47:59.442888Z", "level": "INFO", "name": "api", "message": "Importing an object", "taskName": null, "attributes": {"type": "report", "id": "report--879fc70e-60d3-559b-bab9-bc55d1d4aab2"}}worker-3 | {"timestamp": "2024-04-08T14:47:59.443200Z", "level": "INFO", "name": "api", "message": "Creating Report", "taskName": null, "attributes": {"name": "DigitalSide Malware report: MD5: 9b6c3518a91d23ed77504b5416bfb5b3"}}
worker-4 | {"timestamp": "2024-04-08T14:48:01.454648Z", "level": "INFO", "name": "api", "message": "Report expectation", "taskName": null, "attributes": {"work_id": "work_acef2b25-306a-4bc5-b79d-a1e845e47b95_2024-04-08T14:47:27.695Z"}}worker-4 | {"timestamp": "2024-04-08T14:48:02.281154Z", "level": "INFO", "name": "worker", "message": "Message processed, thread terminated", "taskName": null}worker-4 | {"timestamp": "2024-04-08T14:48:02.281372Z", "level": "INFO", "name": "worker", "message": "Message acknowledged", "taskName": null, "attributes": {"tag": 7303}}
Also, I actually think that the platform is consuming incoming data from all of the other connectors fairly promptly, so after I made the changes to cluster the front-end, making the user-visible process not talked to by the workers, this problem is now limited to preventing these few really large reports from DigitalSide from being ingested.
I'm going to update my fork of the docker project to deploy in this manner by default, because it is a much more resilient and responsive approach than using a single platform
process.
@ckane can you kindly share that with us?
AS we have a large MISP dataset ... and my question was indeed how can I 'cluster' this. Many thanks for all of your (team) insights. Regards. JG
@ckane can you kindly share that with us?
AS we have a large MISP dataset ... and my question was indeed how can I 'cluster' this. Many thanks for all of your (team) insights. Regards. JG
I maintain a public fork of the docker
repository where my docker-compose.yml
is populated with a bunch of OSINT connectors and some additional tweaks that I've found useful. I'll follow up later tonight when I update it, but here's a rough idea for you:
...
opencti-int:
image: opencti/platform:${OPENCTI_VERSION}
environment:
- NODE_OPTIONS=--max-old-space-size=8096
- APP__PORT=8070
- APP__BASE_URL=${OPENCTI_BASE_URL}
- APP__ADMIN__EMAIL=${OPENCTI_ADMIN_EMAIL}
- APP__ADMIN__PASSWORD=${OPENCTI_ADMIN_PASSWORD}
- APP__ADMIN__TOKEN=${OPENCTI_ADMIN_TOKEN}
- APP__APP_LOGS__LOGS_LEVEL=info
- REDIS__HOSTNAME=dragonfly
- REDIS__PORT=6379
- REDIS__TRIMMING=100000
- ELASTICSEARCH__URL=http://elasticsearch:9200
- ELASTICSEARCH__MAX_CONCURRENCY=50
- MINIO__ENDPOINT=${S3_ENDPOINT}
- MINIO__PORT=443
- MINIO__USE_SSL=true
- MINIO__USE_AWS_ROLE=true
- MINIO__ACCESS_KEY=
- MINIO__SECRET_KEY=
- MINIO__BUCKET_NAME=${S3_BUCKET}
- RABBITMQ__HOSTNAME=rabbitmq
- RABBITMQ__PORT=5672
- RABBITMQ__PORT_MANAGEMENT=15672
- RABBITMQ__MANAGEMENT_SSL=false
- RABBITMQ__USERNAME=${RABBITMQ_DEFAULT_USER}
- RABBITMQ__PASSWORD=${RABBITMQ_DEFAULT_PASS}
- SMTP__HOSTNAME=${SMTP_HOSTNAME}
- SMTP__PORT=25
- PROVIDERS__LOCAL__STRATEGY=LocalStrategy
depends_on:
- dragonfly
- elasticsearch
- rabbitmq
restart: always
deploy:
mode: replicated
replicas: 3
endpoint_mode: dnsrr
opencti:
image: opencti/platform:${OPENCTI_VERSION}
environment:
- NODE_OPTIONS=--max-old-space-size=8096
- APP__PORT=8080
- APP__BASE_URL=${OPENCTI_BASE_URL}
- APP__ADMIN__EMAIL=${OPENCTI_ADMIN_EMAIL}
- APP__ADMIN__PASSWORD=${OPENCTI_ADMIN_PASSWORD}
- APP__ADMIN__TOKEN=${OPENCTI_ADMIN_TOKEN}
- APP__APP_LOGS__LOGS_LEVEL=info
- REDIS__HOSTNAME=dragonfly
- REDIS__PORT=6379
- REDIS__TRIMMING=100000
- ELASTICSEARCH__URL=http://elasticsearch:9200
- ELASTICSEARCH__MAX_CONCURRENCY=50
- MINIO__ENDPOINT=${S3_ENDPOINT}
- MINIO__PORT=443
- MINIO__USE_SSL=true
- MINIO__USE_AWS_ROLE=true
- MINIO__ACCESS_KEY=
- MINIO__SECRET_KEY=
- MINIO__BUCKET_NAME=${S3_BUCKET}
- RABBITMQ__HOSTNAME=rabbitmq
- RABBITMQ__PORT=5672
- RABBITMQ__PORT_MANAGEMENT=15672
- RABBITMQ__MANAGEMENT_SSL=false
- RABBITMQ__USERNAME=${RABBITMQ_DEFAULT_USER}
- RABBITMQ__PASSWORD=${RABBITMQ_DEFAULT_PASS}
- SMTP__HOSTNAME=${SMTP_HOSTNAME}
- SMTP__PORT=25
- PROVIDERS__LOCAL__STRATEGY=LocalStrategy
ports:
- "8080:8080"
depends_on:
- dragonfly
- elasticsearch
- rabbitmq
restart: always
worker:
image: opencti/worker:${OPENCTI_VERSION}
environment:
- OPENCTI_URL=http://opencti-int.:8070
- OPENCTI_TOKEN=${OPENCTI_ADMIN_TOKEN}
- WORKER_LOG_LEVEL=info
depends_on:
- opencti
deploy:
mode: replicated
replicas: 6
restart: always
...
Note the following:
opencti-int
is the "background service" for workers/connectors to useopencti-int
listens on port 8070
, does not expose itopencti-int
replicasdeploy
configuration on opencti-int
, it is possible to tell it to use dnsrr
to randomly report DNS for the opencti-int.
domain name (internally) when looked up, which causes workers to connect randomly to all 3 of the configured replicas of opencti-int
worker
and connector-*
services are configured to talk to port 8070
not port 8080
opencti
is the service dedicated to interactive usageopencti
listens on port 8080
and exposes that tooIf you wanted to, you can also cluster the opencti
front-end too (for a lot of end user traffic), you just (I think) need to use vip
instead of dnsrr
and then have the service listen on ports that are different from what is exposed, because the port exposed on the host-side (the one proxy/cloudfront will connect to) is going to conflict with all of the internal services.
But, as I said, I'll work up a modification to my config to incorporate both of these changes, as well as an update to the redis
service in light of that project's restrictive licensing changes (I've found that both valkey
and dragonflydb
work really well, the latter having some performance benefits over valkey/redis, when tuned appropriately).
@ckane Thank you so mucho for your insights and help. I understood (I think) how can I manage our deployment, since we use it on a internal server. I'll look for your updates on your fork too.
Regards. JG
Thanks for all the information @ckane , i will investigate all of this asap
@ckane Thank you so mucho for your insights and help. I understood (I think) how can I manage our deployment, since we use it on a internal server. I'll look for your updates on your fork too.
Regards. JG
Hi @jasgggit - I was mistaken about how to scale the UI service (opencti
) horizontally using docker compose
: It turns out that the only way I could make it work was to make it an internal-only service, and then expose a new nginx
load balancer that forwards the traffic (and uses DNS round-robin, like the opencti-int
container).
Here is my fork of https://github.com/OpenCTI-Platform/docker, which handles scaling for both services via the .env
variables OPENCTI_FRONT_END_REPLICAS
and OPENCTI_BACK_END_REPLICAS
, which default to 1
and 3
, respectively:
Note that it requires the nginx.conf
file to be in the current working directory, and also that this configuration presumes using AWS S3, so if you want the Minio server, it'll need to be added & both opencti instances need reconfiguration.
@jasgggit - I made a small change to my setup again (repository branch is updated): Since when the worker
nodes cause the opencti-int
nodes to stall, it causes these to become unavailable for any connectors. In order to fix this, I created a separate (third) opencti/platform
cluster just for the connectors to talk to, with a default replica count of 1
. This way, the connectors will always have an available platform node regardless of what the worker
nodes are doing.
In my setup, the unavailability of the opencti-int
nodes was causing the connectors to go red on the Ingestion->Connectors page, and thus making them often not be able to inject more work. This change fixes that issue.
@ckane thank you for that, I ll look into it, Regards. JG
@ckane , CPU problem reproduced, tracked in https://github.com/OpenCTI-Platform/opencti/issues/6638
Description
I'm using the provided docker compose setup. The platform node occasionally hangs, with memory usage indefinitely rising and CPU hovering 100%. Notice that no data ingestion is happening during this.
strace
gives the following output repeatedly, which makes me believe this is a locking issue:I found this issue https://github.com/nodejs/node/issues/42277, which seems to be very similar to this case.
The platform logs complain about not being able to acquire redis locks and along with some
ECONNRESET
's to rabbtimq:I reset RabbitMQ and Redis states multiple times to no avail.
Environment