ESET connector causing runaway Redis memory consumption despite REDIS__TRIMMING=1000000

OpenCTI-Platform / connectors

OpenCTI Connectors

https://www.opencti.io

Apache License 2.0

363 stars 394 forks source link

ESET connector causing runaway Redis memory consumption despite REDIS__TRIMMING=1000000 #1834

Closed ckane closed 4 months ago

ckane commented 7 months ago

Description

When enabling the external-import/eset connector with defaults (from external-import/eset/docker-compose.yml), it causes redis to rapidly run away with consuming all memory. I have REDIS__TRIMMING=1000000 set in my docker-compose.yml and it doesn't seem to make a difference for this particular connector, but seems to be working for everything else.

The connector-eset section of my docker-compose.yml is lifted almost verbatim from the example provided in the connector's directory:

  connector-eset:
    image: opencti/connector-eset:5.12.32
    environment:
      - OPENCTI_URL=http://opencti-url
      - OPENCTI_TOKEN=${OPENCTI_ADMIN_TOKEN}
      - CONNECTOR_ID=${ESET_CONNECTOR_ID}
      - "CONNECTOR_NAME=ESET Cyber Threat Intelligence"
      - CONNECTOR_SCOPE=eset
      - CONNECTOR_CONFIDENCE_LEVEL=30 # From 0 (Unknown) to 100 (Fully trusted)
      - CONNECTOR_LOG_LEVEL=error
      - ESET_API_URL=eti.eset.com
      - ESET_USERNAME=${ESET_CONNECTOR_USER}
      - ESET_PASSWORD=${ESET_CONNECTOR_PASSWORD}
      - ESET_COLLECTIONS=ei.misp,ei.cc,ei.urls
      - ESET_IMPORT_APT_REPORTS=true
      - ESET_IMPORT_START_DATE=2022-04-01
      - ESET_CREATE_OBSERVABLES=true
      - ESET_INTERVAL=60 # Required, in minutes
    restart: always

Environment

OS (where OpenCTI server runs): Amazon Linux 2023
OpenCTI version: 5.12.32
OpenCTI client: connector
Other environment details: Using Docker CE and docker-compose

Reproducible Steps

Steps to create the smallest reproducible scenario:

Add connector-eset in my docker-compose.yml
Update .env with proper credentials for access
docker-compose up -d
Wait for awhile and eventually redis grows to consume all RAM (in my case, it got to around 12GB in usage)

Expected Output

Redis shouldn't consume all memory and REDIS__TRIMMING=.... should be able to keep it within a reasonably predictable ceiling. In this particular case, redis resident memory size seems to stay under 2GB when trimming is set to 1000000.

Actual Output

redis memory consumption grows without restraint until the system runs out of memory and the OOM reaper has to kill something.

Additional information

Here is a docker-compose.yml that seems to be working well for me with REDIS__TRIMMING=2000000: https://github.com/ckane/opencti-docker/blob/tf-main/docker-compose.yml

In the deployment I am trying to use it in, I reduced this to 1000000 to see if it would fix the problem, but it doesn't seem to have any impact at all.

In this case, redis memory consumption stays under control, but if I add connector-eset to it, then when the eset connector starts ingesting intel from their feed, redis consumption will grow rapidly until it exhausts all RAM in the system.

nino-filigran commented 6 months ago

@Megafredo & @helene-nguyen could you maybe have a look at this?

Megafredo commented 6 months ago

Hi @ckane, can you give me more details when using this connector, error logs or simply if you have works in progress or completed in your queues? Have you tested in the config to change the import start date? example import_start_date: '2024-01-01'?

ckane commented 6 months ago

Thanks, when it occurs there doesn't seem to be any errors registering in the platform - I just see the ESET connector working normally, and the redis usage keeps climbing to consume all available RAM, which is a behavior I don't encounter from use of any of the other connectors, with the trimming config set.

I haven't tested recently since 6.0.0 was released...I'll update our dev to 6.0.3 and re-enable the eset connector and see if any recent changes cause it to behave differently, or if I can get more data points for you.

I will also try setting a more recent cut-off date to see how that changes things, but that solution isn't ideal: I don't need to do that for any other connectors, and I would like to be able to populate our database with more historical data from the vendor.

Megafredo commented 6 months ago

So from what you have observed, it may not be an error but maybe we can get more information in "debug" mode on what is happening with the connector, can you change the configuration and set this for this variable and can you provide me with the logs ? : CONNECTOR_LOG_LEVEL=debug

ckane commented 6 months ago

So, I am able to reproduce the issue on 6.0.5 and was able to catch the system in the middle of the run-away memory usage. It looks like the ESET connector has finished collecting a large number of bundles from upstream, as its logging is now reporting the Iterating with collection=.../Poll_Request/Poll_Response to the logs. The bundles are all being processed by the workers, and it seems like while being processed by the workers the memory consumption of redis-server is increasing at a rate of about 100MB every 5 minutes or so, and it appears that the workers are doing all the work at this point.

In the worker logs I am seeing the following error happen a lot, but I don't know if it is related to the root cause or coincidental:

Traceback (most recent call last):
  File "/opt/opencti-worker/worker.py", line 268, in data_handler
    self.api.stix2.import_bundle_from_json(
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 215, in import_bundle_from_json
    return self.import_bundle(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 2357, in import_bundle
    self.import_relationship(item, update, types)
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 1223, in import_relationship
    stix_relation_result = self.opencti.stix_core_relationship.import_from_stix2(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 1132, in import_from_stix2
    return self.create(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 611, in create
    result = self.opencti.query(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/api/opencti_api_client.py", line 351, in query
    raise ValueError(
ValueError: {'name': 'MISSING_REFERENCE_ERROR', 'message': 'Element(s) not found'}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/opencti-worker/worker.py", line 268, in data_handler
    self.api.stix2.import_bundle_from_json(
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 215, in import_bundle_from_json
    return self.import_bundle(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 2357, in import_bundle
    self.import_relationship(item, update, types)
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 1223, in import_relationship
    stix_relation_result = self.opencti.stix_core_relationship.import_from_stix2(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 1132, in import_from_stix2
    return self.create(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 611, in create
    result = self.opencti.query(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/api/opencti_api_client.py", line 351, in query
    raise ValueError(
ValueError: {'name': 'MISSING_REFERENCE_ERROR', 'message': 'Element(s) not found'}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/opencti-worker/worker.py", line 268, in data_handler
    self.api.stix2.import_bundle_from_json(
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 215, in import_bundle_from_json
    return self.import_bundle(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 2357, in import_bundle
    self.import_relationship(item, update, types)
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 1223, in import_relationship
    stix_relation_result = self.opencti.stix_core_relationship.import_from_stix2(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 1132, in import_from_stix2
    return self.create(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 611, in create
    result = self.opencti.query(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/api/opencti_api_client.py", line 351, in query
    raise ValueError(
ValueError: {'name': 'MISSING_REFERENCE_ERROR', 'message': 'Element(s) not found'}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/opencti-worker/worker.py", line 268, in data_handler
    self.api.stix2.import_bundle_from_json(
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 215, in import_bundle_from_json
    return self.import_bundle(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 2357, in import_bundle
    self.import_relationship(item, update, types)
  File "/usr/local/lib/python3.12/site-packages/pycti/utils/opencti_stix2.py", line 1223, in import_relationship
    stix_relation_result = self.opencti.stix_core_relationship.import_from_stix2(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 1132, in import_from_stix2
    return self.create(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/entities/opencti_stix_core_relationship.py", line 611, in create
    result = self.opencti.query(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pycti/api/opencti_api_client.py", line 351, in query
    raise ValueError(
ValueError: {'name': 'MISSING_REFERENCE_ERROR', 'message': 'Element(s) not found'}

ckane commented 6 months ago

these seem to occur after the worker says it is forging a new stix_core_relationship - at least in this section, it is trying to create new related-to between an Identity and Indicator

ckane commented 6 months ago

I was also able to verify that the memory consumption in redis continues to increase even after I have stopped the connector-eset container. Clearing the works using the UI doesn't seem to change anything and the connector still says it is active with 330k+ bundles pending.

ckane commented 6 months ago

After I've waited for the connector to turn "red" after I stopped it, I was able to click the "clear connector" button, which did halt the growth in memory consumption (thus confirming my suspicion about the culprit), but clicking that button to clear the connector doesn't appear to clear out whatever has accumulated in redis.

ckane commented 6 months ago

I also see the Message reprocess message in the logs too, ahead of the aforementioned error, suggesting this block of code around line 355 in worker.py from the worker code is being hit a lot:

             elif (
                 "MISSING_REFERENCE_ERROR" in error_msg
                 and self.processing_count < PROCESSING_COUNT
             ):
                 bundles_missing_reference_error_counter.add(1)
                 # In case of missing reference, wait & retry
                 sleep_jitter = round(random.uniform(1, 3), 2)
                 time.sleep(sleep_jitter)
                 self.worker_logger.info(
                     "Message reprocess",
                     {"tag": delivery_tag, "count": self.processing_count},
                 )
                 self.data_handler(connection, channel, delivery_tag, data)
             elif "MISSING_REFERENCE_ERROR" in error_msg:

I did save the logs, so I should be able to do some text parsing to determine how many retries are being attempted on each entity pair in the relationship, and how long these missing references are outstanding for.

ckane commented 6 months ago

Curious if these exceptions occurring may be causing OpenCTI to skip the trimming step on the redis database or something like that

ckane commented 6 months ago

Also came across the following error showing up in the connector-eset logs:

raceback (most recent call last):
  File "/opt/opencti-connector-eset/eset.py", line 221, in _import_collection
    self.helper.send_stix2_bundle(
  File "/usr/local/lib/python3.11/site-packages/pycti/connector/opencti_connector_helper.py", line 1140, in send_stix2_bundle
    bundles = stix2_splitter.split_bundle(bundle, True, event_version)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 90, in split_bundle
    self.enlist_element(item["id"], raw_data)
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 40, in enlist_element
    nb_deps += self.enlist_element(element_ref, raw_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 54, in enlist_element
    nb_deps += self.enlist_element(value, raw_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 40, in enlist_element
    nb_deps += self.enlist_element(element_ref, raw_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 54, in enlist_element
    nb_deps += self.enlist_element(value, raw_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

..... the errors at line 40 & 54 repeat a whole lot .....

  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 54, in enlist_element
    nb_deps += self.enlist_element(value, raw_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 40, in enlist_element
    nb_deps += self.enlist_element(element_ref, raw_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 43, in enlist_element
    elif key.endswith("_ref") and self.is_ref_key_supported(key):
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pycti/utils/opencti_stix2_splitter.py", line 21, in is_ref_key_supported
    if pattern.match(key):
       ^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded while calling a Python object

It looks like this happens exactly once. Otherwise, there are only 4 more errors in my connector's log, and all of them have to do with connectivity issues (likely just temporary network or service interruption).

ckane commented 6 months ago

@Megafredo I am now running into the redis memory consumption behavior on another instance using entirely OSINT connectors, where I am able to collect the data more easily to help determine root causes, etc...

Memory usage for stream.opencti seems like it stays around 5-6GB:

redis-cli memory usage stream.opencti
(integer) 5319923328

And, it does look like trimming is working correctly on it (keeping it at 2000000 entries)

Sampled 464565 keys in the keyspace!
Total key length in bytes is 29935559 (avg len 64.44)Biggest   hash found '"work_Hatching_Triage_Sandbox_Auto_2024-02-15T05:56:56.212Z"' has 3 fieldsBiggest string found '"sess:XP5B5ZvQhttlL7ox8X2OD52Gwxa6If5e"' has 33259 bytesBiggest stream found '"stream.opencti"' has 2000000 entriesBiggest   zset found '"platform_sessions"' has 343 members

Dumping all keys says that there are 464567 keys in my redis instance:

redis-cli keys \* | wc -l
464567

I am currently iterating through them 1 by 1 to get each individual size, and I'll try to get a summary of this data over to you later

ckane commented 6 months ago

So, I did some experiments where I used the MEMORY USAGE <key> on each of the keys in my redis instance to produce a table such as the following, with the size (in bytes) followed by the key name (we will call this table redis_keys.txt):

...
216 work_Hatching_Triage_Sandbox_Auto_2024-02-15T02:40:02.830Z
216 work_Hatching_Triage_Sandbox_Auto_2024-02-15T02:26:07.294Z
232 work_843bded1-b064-4269-99fb-b48b8e61c4bf_2024-02-17T12:53:21.313Z
232 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-28T07:35:53.427Z
232 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-05T01:03:52.765Z
232 work_b0d88b1a-9b5e-4400-83a0-239cc2cc6dbd_2024-02-28T01:32:43.860Z
232 work_b0d88b1a-9b5e-4400-83a0-239cc2cc6dbd_2024-02-24T21:21:33.871Z
232 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-15T03:10:28.923Z
232 work_b0d88b1a-9b5e-4400-83a0-239cc2cc6dbd_2024-02-17T21:43:33.292Z
232 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-21T23:10:41.025Z
...

This yielded the following when I pipe it to sort -n (showing just the tail of it here):

24664 sess:wxhlVhWmR3gsoXhriLsAuUjRvdcOaIyS
24664 sess:xCvrBGj76182cBrpwrVMB21g2SO9wxe8
24664 sess:xkZdJzPOWeQUp21V67Fy2ccfxLknJc9w
24664 sess:xqcan--juh6j5vZ1WzL_fvt1J2triAma
24664 sess:xtBqtgMJU44mriU8ggAIskgeDwCmOmt5
24664 sess:yDKxZVJ_je4hrNRRhZR0Km_qumGhg4Ws
24664 sess:yGSjcIZe6QMVbMz9nqEUMJ4g0HU9Ej--
24664 sess:yO-1xBSZaW7eCOeCoDjgzkOZ_5XoTJX4
24664 sess:yVqd7YeS_GrolN3juJRNhRkMR1w7eNoD
24664 sess:yZij54dVUrCBRWdCIjUbA9ZAI9KY18MJ
24664 sess:yb4uQtSvYApZ4du5cWZbx_ko8Wm_ywG_
24664 sess:yy01MNbLh1hmbkGQTAKo8i9DNH7GuKhN
24664 sess:z2Jh1iJ4sxoKzj8yCWaPekJFfe_vTtIG
24664 sess:zEy_vH3q706GC3sMSxNyfYiYejeoyRov
24664 sess:zHB2yYgk-L6G_60QtLBEDDF1n8zQiUvh
24664 sess:zkcIOp_J_N6yl0C5sxyq7IkkVGwPYpAX
41048 sess:XP5B5ZvQhttlL7ox8X2OD52Gwxa6If5e
44417 platform_sessions
20222660 stream.activity
5747354268 stream.opencti

In the above you can see that there are a bunch of sessions of ~24KB in size, but the biggest individual consumers are stream.activity at ~20MB of usage, and stream.opencti at ~5.7GB of usage.

However, if I do a MEMORY STATS in redis-cli, it gives the following usage report:

...
25) "keys.count"
26) (integer) 464903
27) "keys.bytes-per-key"
28) (integer) 29048
29) "dataset.bytes"
30) (integer) 13481932644
...

The above indicates that 464903 keys are consuming ~13.5GB of memory.

Curious, I dumped the redis_keys.txt table through wc -l to get the count of keys I had generated earlier (note the slight discrepancy is largely due to some time passing between running commands):

wc -l < redis_keys.txt
464589

Also wanting to measure the sum of data in each of the keys, I did cat output.txt | cut -d' ' -f 1 | awk '{s+=$1} END {printf "%.0f\n", s }':

5882285305

Which indicates that the total data being contained within the keys is roughly 5.9GB. I definitely anticipate there would be some amount of per-key overhead, but this would suggest like a ~16KB average per-key overhead, to me.

ckane commented 6 months ago

      1 work:027aa3b2-7329-45bc-b61d-c2fd7040a805
      1 work:25fedf57-2a77-4e14-b9de-bc3775e4ac7c
      1 work:33002c41-4b4b-4642-b8f7-2006f2ac2979
      1 work:37a2a506-17ca-47dc-96f0-33764f4754db
      1 work:3e1112af-06ec-4d9d-87c3-8e5664df55b8
      1 work:4cff7afc-90ba-420f-b8ee-8c4baffeae7e
      1 work:6133a1fd-bd5c-4726-85bb-b17e7007d093
      1 work:6feff0e9-86de-45cc-a0a3-082c3ef23cc1
      1 work:72327164-0b35-482b-b5d6-a5a3f76b845f
      1 work:746a3cfd-e891-4fe6-9b6f-6d4a94c3620f
      1 work:843bded1-b064-4269-99fb-b48b8e61c4bf
      1 work:8623972e-459d-47e3-84c7-ef269ed89d75
      1 work:883fbe6c-4040-48c0-bc3c-398092dbf6b5
      1 work:8ada702d-a621-4da4-905a-2f5f7f62f907
      1 work:Hatching
      1 work:acef2b25-306a-4bc5-b79d-a1e845e47b95
      1 work:b0d88b1a-9b5e-4400-83a0-239cc2cc6dbd
      1 work:cfcec2c7-7ea6-427c-b4a3-281683e9fb10
      1 work:d246bd1e-8d1b-4739-9187-8e05fa36e79c
      1 work:ed49e47d-a06a-4d4d-aa94-06dcf584f071
      1 work:f7052268-95f0-49f2-9e70-42c8a87f2fd2
      1 work:fc0eb101-1eef-4348-8953-a6bcda869c95
      1 work:fe54a9c7-749e-499f-9298-3f977a7d3327
      1 work_37a2a506-17ca-47dc-96f0-33764f4754db
      1 work_6feff0e9-86de-45cc-a0a3-082c3ef23cc1
      1 work_d246bd1e-8d1b-4739-9187-8e05fa36e79c
      1 work_ed49e47d-a06a-4d4d-aa94-06dcf584f071
      2 work_027aa3b2-7329-45bc-b61d-c2fd7040a805
      3 work_6133a1fd-bd5c-4726-85bb-b17e7007d093
      3 work_8623972e-459d-47e3-84c7-ef269ed89d75
      4 work_4cff7afc-90ba-420f-b8ee-8c4baffeae7e
      7 work_cfcec2c7-7ea6-427c-b4a3-281683e9fb10
      8 work_883fbe6c-4040-48c0-bc3c-398092dbf6b5
     10 work_72327164-0b35-482b-b5d6-a5a3f76b845f
     58 work_fc0eb101-1eef-4348-8953-a6bcda869c95
    807 work_acef2b25-306a-4bc5-b79d-a1e845e47b95
   3059 work_f7052268-95f0-49f2-9e70-42c8a87f2fd2
   4053 work_843bded1-b064-4269-99fb-b48b8e61c4bf
  18500 work_25fedf57-2a77-4e14-b9de-bc3775e4ac7c
  89189 work_Hatching_Triage_Sandbox_Auto
  90471 work_b0d88b1a-9b5e-4400-83a0-239cc2cc6dbd
 257942 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f

The 257942 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f represents the "Google DNS" connector, which I have auto-enriching in this instance, but looking at the "Connectors" view it shows a 0 for the "MESSAGES" column of that particular connector. This instance did just recent pull in a whole bunch of misp-feed data from DIGITALSIDE.IT.

In the other instance that's consuming from ESET, I do not have any of these auto-enrichments turned on.

ckane commented 6 months ago

I decided to narrow-in to the work listed above with the 257942 keys, and do a summary count table that includes the date from the timestamp, but not the time:

cat output.txt | grep work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f | cut -d' ' -f 2 | sort | cut -dT -f 1 | uniq -c | sort -n
...
    130 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-02
    133 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-11
    136 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-09-28
    139 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-18
    148 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-27
    154 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-21
    160 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-05-11
    161 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-05-08
    162 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-10
    166 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-17
    167 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-31
    168 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-12
    170 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-10
    176 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-07
    180 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-26
    186 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-29
    187 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-10-26
    189 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-09-11
    189 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-21
    205 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-16
    207 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-04
    226 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-09
    228 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-23
    235 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-08
    239 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-03
    251 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-05-09
    265 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-21
    272 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-05-10
    300 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-20
    304 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-16
    341 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-01
    381 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-07
    389 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-01
    392 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-03
    415 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-16
    430 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-22
    460 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-05
    502 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-13
    537 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-06
    545 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-07-25
    573 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-05-07
    585 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-25
    630 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-06
    684 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-12-05
    703 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-08
    731 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-07-26
    732 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-07
    791 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-23
    880 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-05
    948 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-01
   1219 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-04
   1437 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-02
   1538 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-03-08
   1587 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-30
   1608 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-05-06
   2017 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-14
   2489 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-26
   3334 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-02
   3719 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-12
   4334 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-01-19
   4997 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-20
   5108 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-23
   5112 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-22
   6251 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-21
   6428 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-25
   6572 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-11-11
   8399 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-29
  10062 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-27
  12296 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-16
  15521 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-19
  16708 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-24
  22944 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-15
  25492 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-17
  33717 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-18
  34154 work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2024-02-28

Many of these are fairly recent, but there also appear to be a considerable number that go back many months, with the earliest example I have in my system being work_746a3cfd-e891-4fe6-9b6f-6d4a94c3620f_2023-05-06. I'm wondering if certain failure states are maybe leaving these keys abandoned in redis never to get cleaned up.

Since this is merely a testing instance, I'm going to snapshot and try clearing out some older ones, and see if that changes the redis memory utilization.

Jipegien commented 6 months ago

@Megafredo Have you enough information to troubleshoot? You may reach out to platform team for some help regarding redis?

Megafredo commented 6 months ago

First of all, I wanted to thank you @ckane for your research! Since the connector is in the support: partners category, I don't really have all the necessary access to reproduce it on my end, so any feedback is welcome!

Yes, one of the best methods for a "clean" deletion of the connector is to stop it and wait for it to turn red, then delete it, this should clean everything in RabbitMQ that is associated with that connector (you can find all the information about what is waiting for the connector in RabbitMQ Management and in the "Queues and Streams" section).

"but clicking this button to clear the connector doesn't seem to clear whatever has accumulated in redis", is this still the case?

'MISSING_REFERENCE_ERROR', 'message': 'Item(s) not found'

For this error, it occurs when it cannot establish the relationship between an entity A and an entity B, and this can come from the fact that either entity A or entity B are incorrect or do not exist. just not. In our case, I don't think there is any effect on memory.

Do you have this error? "REDIS: Failed to extend resource Example:

If this is the case, then the logic behind the "failed to extend resource" is that the loop of the process node of the ingest or backend is too slow (certainly overloaded by a process of its event loop) to reclaim a lock on say again. The opencti and resource managers use locks in redis (the lock has a TTL of 30s), if the process node holding the lock does not reclaim it during the 30 seconds of its lifespan and it makes another request after that, then the failed to extend resource error will appear because the lock is no longer valid.

If the problem comes from this, we could test increasing the necessary pause time in the connector between each ingestion, or creating more workers to increase the ingestion speed (less recommended), or the test I showed you with today's date in the config to see if the memory increases even if the data is smaller.

"RecursionError: maximum recursion depth exceeded when calling a Python object"

For this error you have raised may indeed come from the connector, it will have to be reworked to remove the recursion through simple loops, but it will be necessary to investigate further in this direction.

ckane commented 6 months ago

First of all, I wanted to thank you @ckane for your research! Since the connector is in the support: partners category, I don't really have all the necessary access to reproduce it on my end, so any feedback is welcome!

Yes, one of the best methods for a "clean" deletion of the connector is to stop it and wait for it to turn red, then delete it, this should clean everything in RabbitMQ that is associated with that connector (you can find all the information about what is waiting for the connector in RabbitMQ Management and in the "Queues and Streams" section).

"but clicking this button to clear the connector doesn't seem to clear whatever has accumulated in redis", is this still the case?

Yes, after the connector was cleared, it seems like redis still keeps all of the memory allocated and doesn't release it. It seems like this allocation gets committed to disk in the dump.rdb as well, as stopping and restarting redis doesn't seem to fix it.

ckane commented 6 months ago

Since the "redis grows uncontrollably" problem has been raised a few times by others that I've talked to (particularly in Slack), I'm curious if there's any interest in looking into any alternatives to it.

There is one project Dragonfly DB that offers a Redis-compatible interface (though it lacks some newer features in Redis 6/7), and it isn't compatible with newer dump.rdb for storage-migration.

While researching, I have been able to successfully stand up a new OpenCTI instance replacing the redis section with the following (and adjusting opencti configuration appropriately):

  dragonfly:
    image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.15.0
    restart: always
    command:  --dbfilename dump --port 6379 --logtostderr --keys_output_limit 1000000
    ulimits:
      memlock: -1
    volumes:
      - dfdata:/data

Largely it seems to be working for me on this new test instance. DragonflyDB claims to be higher-performance and have a lower memory footprint as well as do a better job returning memory to the OS and managing its memory.

Challenge is that any migration from redis to dragonfly has to be done manually as a "live" migration using some specialized tools. I have a Python script that I cloned from someone else's repository and then modified which successfully migrated everything except for stream.* keys, which I think really just need the stream entries migrated in chunks rather than supporting the binary migration the rest of the key types support.

Would be interested in, if you happen to have some automated testing & data population that you could test with replacing the redis service with the dragonfly service above, and see if it results in any improvements or regressions.

ckane commented 6 months ago

FWIW I am still running regular redis 7.2.4 on my system that was pulling from ESET, for now. Next week, I may consider trying to re-deploy with this change as well, after trying to rewrite my migration script to support migrating streams by chunking them into smaller pieces, and then enable ESET again and see how that works out.

Megafredo commented 6 months ago

This is outside my area of expertise, I will invite the platform team for the rest. @pierremahot, @sbocahu !

ckane commented 6 months ago

Ty for all the help @Megafredo!

ckane commented 6 months ago

Here is another project that is backed by Snap Inc., is open source, is a drop-in replacement for Redis, and offers an eviction policy that can evict to disk rather than purging it from the database entirely: https://docs.keydb.dev/docs/flash (The disk-backed eviction feature seems to have another dependency, and is advertised as "BETA" right now, but seems worth looking into since we're trying to solve a "redis must be persistent" problem here)

ckane commented 6 months ago

I'll try migrating one of my instances to this sometime over the next week or so and report back with my experience - it might help solve the limitation that prevents us from using maxmemory on redis today.

ckane commented 6 months ago

Seems like DragonflyDB is a decent compromise, along with setting REDIS__TRIMMING=500000. DragonflyDB seems to return RAM back to the OS more readily, while it seems like once it gets allocated in dump.rdb, redis then seems to hold onto all the RAM necessary for both the snapshot to disk as well as the run-time overhead.

pierremahot commented 5 months ago

Hello @ckane, Just to explain a little bit and try to get the issue down.

The connector is creating stix bunddle into the rabbitmq
Then the worker consumer the queue of the connector and ask mutation to the opencti ingest/plaform
The opencti ingest/platform then create the entry onto elasticsearch and publish the stix_bundle to the opencti Redis stream to inform other process for exporting data or playbooks or other process.

The REDIS_TRIMMING define the number of message into the stream. The more message you allow the more history you have but the more ram you consume. ESET connector may create big stix bundle that are with the stream consuming a lot of memory from REDIS.

The stream history may be needed for other process to react to creation deletion are update and the size of the history help to get recover from a big load but if you don't want to have redis memory going to infinit you can reduce the REDIS__TRIMMING to 100000.

ckane commented 5 months ago

Thanks @pierremahot - I am also suspecting that the issue is a small number of large-size bundles that are ending up in the redis history cache, and stick around for a long time (until being pushed out by newer items). Also noticed that it occurred on some misp-feed connector instances as well, later on, so I don't think it is a problem specific to this connector, but rather an issue resulting from how Redis is employed for this purpose.

I suspect a more resilient alternative for higher availability would be using AWS ElastiCache w/ serverless or auto-scaling deployments.

ckane commented 5 months ago

Hi @pierremahot - I did some more digging on this tonight, and it appears that what is happening is that it keeps adding new entries to the stream.opencti stream key with increasingly long lists of STIX malware entities. I wrote a simple bash loop to extract every entry in stream.opencti and store it into a file named with its id. The data in the directory added up to 13GB which matches closely to the redis memory usage.

I am curious if the newer one should be overwriting the older one, rather than adding a brand new entry that's slightly longer, because it seems like the newer items contain a superset of the modification that the older ones do.

Here are some examples of what I am seeing when I get the message field from each really big item from the stream:

jq '.[7]' 1713040918333-0.json
"\"adds `malware--9d11ad68-46aa-4bc7-bf41-c80f9c078b9e, malware--a19dc0a8-1446-4d65-8b4e-b83f39f66113, malware--33495792-40e1-42a4-8b90-b1b58afe2bc7` in `STIX IDs` and 17767 more items\""

jq '.[7]' 1713040945081-0.json
"\"adds `malware--9d11ad68-46aa-4bc7-bf41-c80f9c078b9e, malware--a19dc0a8-1446-4d65-8b4e-b83f39f66113, malware--33495792-40e1-42a4-8b90-b1b58afe2bc7` in `STIX IDs` and 17779 more items | replaces `Detected malware tried to contact the URL.` in `Description`\""

jq '.[7]' 1713040946388-0.json
"\"adds `malware--9d11ad68-46aa-4bc7-bf41-c80f9c078b9e, malware--a19dc0a8-1446-4d65-8b4e-b83f39f66113, malware--33495792-40e1-42a4-8b90-b1b58afe2bc7` in `STIX IDs` and 17781 more items | replaces `Detected malware tried to contact the URL.` in `Description`\""

When the system goes out of memory, there are thousands of these in the queue, gradually going in size from oldest (smaller) to newest (largest). Whatever is generating this repeating pattern of making a copy of the prior stream entry with additional malware in the list - that's what is causing the memory consumption to go out of control.

ckane commented 5 months ago

Bash script I used to iterate across all entries in the stream.opencti in redis and write them to individual files on disk for analysis:

export h="0-0"
while true; do
  docker compose exec redis redis-cli --json xread count 1 streams stream.opencti "$h" > cur.json
  export h=$(jq -r '.["stream.opencti"][][0]' cur.json)
  jq -r '.["stream.opencti"][][1]' < cur.json > "$h".json
done

ckane commented 5 months ago

Many of these entries are 100's of kB each, with the largest being 871kB (before the process crashed due to out of memory), and all contain largely redundant information that basically just extends a list with additional entities, which then is also included in its entirety in the data field - this is what causes the exploding memory consumption.

ckane commented 5 months ago

I extracted the data field from one of the entries with the really long list (over 17,000 entries) - here is its reformatted json, note that I have removed most of the stix ids from the list for brevity, but they are all in the original stream entry

{
  "id": "malware--4d6fe57a-e362-55f1-a812-e7aced9ec899",
  "spec_version": "2.1",
  "type": "malware",
  "extensions": {
    "extension-definition--ea279b3e-5c71-4632-ac08-831c66a786ba": {
      "extension_type": "property-extension",
      "id": "8d6de12a-de56-4739-ad05-2d6c9ad72c5b",
      "type": "Malware",
      "created_at": "2024-04-13T10:24:37.006Z",
      "updated_at": "2024-04-13T20:42:25.403Z",
      "stix_ids": [
        "malware--9d11ad68-46aa-4bc7-bf41-c80f9c078b9e",
        "malware--a19dc0a8-1446-4d65-8b4e-b83f39f66113",
        "malware--33495792-40e1-42a4-8b90-b1b58afe2bc7",

        // .... over 17,000 ids in this list ...

        "malware--641886c5-1137-4c96-bf28-878734c0f310"
      ],
      "is_inferred": false,
      "creator_ids": [
        "88ec0c6a-13ce-5e39-b486-354fe4a7084f"
      ],
      "labels_ids": [
        "d134d29e-814b-4fa5-9b76-5780aa73034b"
      ]
    }
  },
  "created": "2024-03-29T01:05:15.000Z",
  "modified": "2024-04-13T20:42:25.403Z",
  "revoked": false,
  "confidence": 100,
  "lang": "en",
  "labels": [
    "trojan"
  ],
  "name": "PDF/Phishing.D.Gen trojan",
  "description": "Detected malware tried to contact the URL.",
  "is_family": false
}

ckane commented 5 months ago

Looking at this, it seems like it might be an effect of merging malware entities? They're coming in as unique STIX ids from the source (maybe? or maybe auto-created in the connector?) and then the de-duplication feature is identifying that all of these entities match malware names, so folding them into a single malware sample, while adding the stix_ids to the logged event. I'm testing commenting out the malware type in the connector to see if/how that changes behavior, but these may be a result of malware relationships in other entities in the platform.

Might be worth stripping the stix_ids from what's sent to redis and seeing if that breaks anything.

ckane commented 4 months ago

Ok I think I have a better handle on what's going on here, I suspect it has to do with random STIX uuid's being autogenerated for each occurence of malware given, for instance, the name "PDF/Phishing.D.Gen trojan", showing up in the ESET TAXII feed....likely because this is MISP-backed data where these would be handled as tags/galaxies/labels or something like that. Anyhow, the behavior populates a long sequence of increasingly growing lists as more random STIX ids for malware (not artifacts, but malware) get added (as aliases?) to a particular malware entity.

I'm messing with some mitigations that limit the length of the stix_ids list where they show up in the property-extension type above. I'll submit a separate PR for that after I get something working well, and continue the discussion there.

ckane commented 4 months ago

@pierremahot @Megafredo I am fairly certain that the changes in my PR #2071 will fix this bug. I still also recommend considering the changes in https://github.com/OpenCTI-Platform/opencti/pull/6774, or some alternative fix that addresses the same data issue more elegantly, as a resource-consumption optimization for how we utilize redis.

Megafredo commented 4 months ago

Hi @ckane, thanks again for your feedback and contributions ! We will discuss this with the teams involved !