Open MaxwellDPS opened 1 week ago
Do you have any monitoring dashboard on the nodejs process? Memory, CPU, Event loop lag, active handles ? Thanks
For example, its the profile ourof an internal instance with a very large dataset / number of connectors running.
Hey @richard-julien, I do not have this monitoring for node. I have the Prometheus metrics on the API and workers and that is it
NVM looks like I have them, can I get the JSON for that dashboard
Hey Julien here is -3h from now of node metrics.
New symptoms
Containers:
opencti-web:
Ports: 8080/TCP, 14269/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Mon, 24 Jun 2024 09:15:06 -0700
Last State: Terminated
Reason: Error
Exit Code: 134
Started: Mon, 24 Jun 2024 07:59:41 -0700
Finished: Mon, 24 Jun 2024 09:15:05 -0700
Thanks for the dashboard. Looks like there is some CPU > 100% spike along with memory spike > 2 GB. This kind of behavior can really create various type of problems. We work a lot lately to fix this kind of situation but looks like we still have some work to do. Could be helpful to have an idea of the kind of data ingested during this spike, like report with millions of object_refs?
For the sync queue if the queue is not processing, there is maybe an error that prevent the last message to be processed and so you continously try to ingest the same message. If you can pause your connectors waiting to have only this queue alive to isolate the errors, this could be helpful.
So on the import side, this is happeningwhen the platform is idle (not just during import). There is no connector that triggers it, just opening the WebUI can trigger it. (web pods arent even processing this data, all connectors and workers point to another set of pods)
We also dont have any reports that have >500k Refs - There is no activity occurring during these events, I can have a queue of 0 and it still happens. It also happens so often it is inhibiting our ability to use the platform
For the sync queue, there are 1k messages and they are not climbing, seems to not be a connector (or must be TAXII)
Can you isolate a screen that generate this situation? And check the network of the browser to identify the different calls done by the screen? Then try to isolate the query responsible for the situation? thanks
Julien its happening when nothing is being opened. Im not sure how else to word this - Its crashing at a dead idle.
Last crash was ~40 min ago. Wasnt doing anything on it, no connectors, no workers - It just died.
Only activity would be for a single TAXII feed poll with ~4k SCO's at a 10 min interval
Maybe it comes from a manager. You can try to disable all the managers to try to isolate the one that produce the CPU / memory spike by reactivating it one by one.
All managers are disabled, this is the current config
CONNECTOR_MANAGER__ENABLED: "false"
EXPIRATION_SCHEDULER__ENABLED: "false"
HISTORY_MANAGER__ENABLED: "false"
IMPORT_CSV_CONNECTOR__ENABLED: "true"
IMPORT_CSV_CONNECTOR__VALIDATE_BEFORE_IMPORT: "true"
INDICATOR_DECAY_MANAGER__ENABLED: "false"
INGESTION_MANAGER__ENABLED: "false"
ACTIVITY_MANAGER__ENABLED: "false"
NOTIFICATION_MANAGER__ENABLED: "false"
PLAYBOOK_MANAGER__ENABLED: "false"
PUBLISHER_MANAGER__ENABLED: "false"
RETENTION_MANAGER__ENABLED: "false"
RULE_ENGINE__ENABLED: "false"
SYNC_MANAGER__ENABLED: "false"
TASK_SCHEDULER__ENABLED: "false"
It just died.
Here is current resource useage, looked comparable at the time. (opencti-opencti-web are the ones doing nothing)
NAME CPU(cores) MEMORY(bytes)
opencti-elastic-es-leaders-0 799m 3621Mi
opencti-elastic-es-leaders-1 146m 3652Mi
opencti-elastic-es-leaders-2 414m 3805Mi
opencti-elastic-es-data-0 6885m 13868Mi
opencti-elastic-es-data-1 6308m 13807Mi
opencti-elastic-es-data-2 6545m 13699Mi
opencti-minio-5f64757877-dlzgn 2m 516Mi
opencti-opencti-api-7fbbc6b4cd-2lkgg 133m 561Mi
opencti-opencti-api-7fbbc6b4cd-fqbqf 11m 481Mi
opencti-opencti-api-7fbbc6b4cd-trdtv 563m 1265Mi
...
opencti-opencti-web-8559898bf5-8b6g4 398m 736Mi
opencti-opencti-web-8559898bf5-b9mqz 442m 757Mi
opencti-opencti-web-8559898bf5-ct2d7 348m 758Mi
opencti-opencti-worker-c9b55c8df-dcrx8 9m 49Mi
opencti-opencti-worker-c9b55c8df-ntbsl 13m 49Mi
opencti-opencti-worker-c9b55c8df-qtgp2 11m 50Mi
opencti-opencti-worker-c9b55c8df-srfms 7m 49Mi
opencti-opencti-worker-c9b55c8df-vr8k7 12m 49Mi
opencti-rabbitmq-server-0 9m 554Mi
opencti-rabbitmq-server-1 13m 433Mi
opencti-rabbitmq-server-2 19m 508Mi
opencti-redis-master-0 56m 3637Mi
Can you put the log level to DEBUG on the instance that should purely idle and send it to me? thanks
Will do, just changed level, I'll email logs when they are available
Current status of the push_sync
Virtual host | Name | Node | Type | Features | Consumers | Consumer capacity | State | Ready | Unacked | In Memory | Persistent | Total | Ready | Persistent | Total | incoming | deliver / get | redelivered | ack |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
/ | push_sync | rabbit@opencti-rabbitmq-server-1.opencti-rabbitmq-nodes +2 | quorum | D Args | 5 | 100% | running | 2,025 | 5 | 0 | 2,030 | 2,030 | 1.5 GiB | 1.5 GiB | 1.5 GiB | 0.00/s | 0.00/s | 0.00/s | 0.00/s |
Not sure its related, but seeing a "socket hangup" on RMQ
{
"category": "APP",
"errors": [
{
"attributes": {
"genre": "TECHNICAL",
"http_status": 500
},
"message": "socket hang up",
"name": "UNKNOWN_ERROR",
"stack": "UNKNOWN_ERROR: socket hang up\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at UnknownError (/opt/opencti/build/src/config/errors.js:82:47)\n at Object._logWithError (/opt/opencti/build/src/config/conf.js:235:17)\n at Object.error (/opt/opencti/build/src/config/conf.js:244:48)\n at Object.willSendResponse (/opt/opencti/build/src/graphql/loggerPlugin.js:153:20)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async Promise.all (index 1)\n at b (/opt/opencti/build/node_modules/apollo-server-core/src/requestPipeline.ts:530:5)\n at processHTTPRequest (/opt/opencti/build/node_modules/apollo-server-core/src/runHttpQuery.ts:437:24)"
},
{
"message": "socket hang up",
"name": "Error",
"stack": "Error: socket hang up\n at Function.Pce.from (/opt/opencti/build/node_modules/axios/lib/core/AxiosError.js:89:14)\n at dx.handleRequestError (/opt/opencti/build/node_modules/axios/lib/adapters/http.js:610:25)\n at dx.emit (node:events:519:28)\n at ClientRequest.lyn.<computed> (/opt/opencti/build/node_modules/follow-redirects/index.js:38:24)\n at ClientRequest.emit (node:events:519:28)\n at Socket.socketOnEnd (node:_http_client:524:9)\n at Socket.emit (node:events:531:35)\n at endReadableNT (node:internal/streams/readable:1696:12)\n at processTicksAndRejections (node:internal/process/task_queues:82:21)\n at xyn.request (/opt/opencti/build/node_modules/axios/lib/core/Axios.js:45:41)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at metricApi (/opt/opencti/build/src/database/rabbitmq.js:116:22)\n at getMetrics (/opt/opencti/build/src/domain/rabbitmqMetrics.js:7:17)"
}
],
"inner_relation_creation": 0,
"level": "error",
"message": "socket hang up",
"operation": "Unspecified",
"query_attributes": [
[
{
"arguments": [],
"name": "rabbitMQMetrics"
}
]
],
"size": 2,
"source": "backend",
"time": 47,
"timestamp": "2024-06-24T20:38:26.295Z",
"type": "READ_ERROR",
"user": {
"group_ids": [
UUID,
UUID
],
"ip": "192.168.6.164",
"organization_ids": [
"828b5f70-eda2-4eb6-9879-ed5d0b5afe42"
],
"referer": "https://<REM>/dashboard/data/ingestion/connectors",
"socket": "query",
"user_id": UUID,
"user_metadata": {}
},
"version": "6.1.6"
}
So testing, the CSV import is hosed, it will not do anything. All messages in the push_sync queue are stalled, none are moving even after a full clear and restart
Is there a speific log you are looking for here @richard-julien
Im looking of what could be the activity on this instance that should idle. (kind of queries to find the service that generate the activity)
Yeah so, only thing running would be a query to a TAXII feed every 10 min. The other thing Im noticing is CSV import comes and goes, it doesnt seem to be running reliably.
Also looks like none of the consumers on push_sync are acking messages
Description
JavaScript heap out of memory OOMs plaguing uptime and usability of 6.1.X - Seems linked to import of data
Environment
Reproducible Steps
Steps to create the smallest reproducible scenario:
None known - Has been continuous since the 6.1.X upgrade Seems to be far worse when data is being imported.
RabbitMQ is showing a backlog of ~1k messages in the push_sync queue that are going nowhere. (Connector queues are dropping)
Redis memory is spiking at the times this is happening
Expected Output
no platform crashes
Actual Output
Additional information
Logs start up to crash