fox-it / flow.record

Recordization library
GNU Affero General Public License v3.0
7 stars 9 forks source link

Force the output of key-value pairs for dicts, lists and tuples in the JSON adapter #119

Open Zawadidone opened 4 months ago

Zawadidone commented 4 months ago

This issue is related to https://github.com/fox-it/dissect.target/pull/681.

The field type digest is a dict in Python which is an object in JSON, as shown below.

target-query -t <TARGET> -f mft --hash --limit 1 | rdump -J | jq
[...]
  "path_resolved": "c:
$MFT",
  "path_digest": {
    "md5": "5dd5bd6f342c2bceb93dc67783f3991a",
    "sha1": "de9713a272acb20fa9b34296e33e3a409675a3c7",
    "sha256": "be58856974ed849e5edcc4752d716209b80fe1914a20d8528316f8732a33697c"
  }
}

Depending on which search platform is used, e.g. Elasticsearch huntandhackett/ir-automation@a005f5d/logstash-dissect.conf#L54-L62, it is a hassle to store records in a structured data format without introducing blind spots. Note that we use the JSON adapter in combination with Logstash to save records in Elasticsearch, not the Elasticsearch adapter.

Because of that I want to force the output of Python dicts (dictlist), lists (typedlist) and tuples to use key-value pairs when using the JSON adapter.

target-query -t <TARGET> -f mft --hash -~~limit 1 | rdump -J -~~<SPECIFIC-FLAG> | jq
[...]
  "path_resolved": "c:
$MFT",
  "path_digest_md5": "5dd5bd6f342c2bceb93dc67783f3991a",
  "path_digest_sha1": "de9713a272acb20fa9b34296e33e3a409675a3c7",
  "path_digest_sha256": "be58856974ed849e5edcc4752d716209b80fe1914a20d8528316f8732a33697c",
}

Suggested implementation:

Python JSON
- - -
-------- --------
dict: path_digest = {"md5": [...]} {"path_digest_md5":[...]}
tuple: fieldname = (1,2) {"fieldname_0": 1, "fieldname_1": 2}
list: fieldname = (1,2) {"fieldname_0": 1, "fieldname_1": 2}



I don't know how this would apply for the field type command or other field types that I haven't mentioned.|

Zawadidone commented 4 months ago

@yunzheng do you have a suggestion on how this issue can be resolved?

JSCU-CNI commented 4 months ago

If you don't mind, I have some suggestions for your elastic setup @Zawadidone. You could probably fix this without changing flow.record. Instead, have you tried using a different processor?

You could use an ingest node pipeline to edit every document before it is ingested by elasticsearch. Kibana has a nice UI for this as well. You could also use the Logstash json filter plugin or the Filebeat decode_json_fields plugin.

We are thinking about open sourcing our elastic index mapping for dissect records. Is that something you would be interested in?

Zawadidone commented 4 months ago

@JSCU-CNI thanks for the suggestion.

I am aware of the solutions that Logstash and Elasticsearch provide. But to solve this issue, I am looking for a solution whereby the Logstash configuration and/or the Elasticsearch configuration don't need to be adjusted after every Dissect Target update, that adds new records and field types.

Yes I am very interested in the Elastic index mapping for Dissect records. We currently use our own Logstash configuration to ingest records into Elasticsearch, whereby our own fork of Timesketch is used to perform analysis.

We have explored the usage of a Dissect Elasticsearch index template or the use of an dynamic index template that can use the same fields for different data types, but we haven't yet made a decision on that.

yunzheng commented 4 months ago

@yunzheng do you have a suggestion on how this issue can be resolved?

I think the way you suggest is one of the better ones, which is on the field type level which makes it predictable and testable.

Another "easier" way would just be to flatten the JSON dictionary in the JsonfileWriter adapter, something like https://pypi.org/project/flatten-json/ . I would probably then just drop outputting the record descriptors as they are not in sync anymore.

More difficult would be to do this generically on the record level itself (so all adapters could benefit from a --flatten flag), however, for every flattened field that results in a new field would mean you need to update the RecordDescriptor. Which could be a performance issue.

Zawadidone commented 4 months ago

@yunzheng thanks for the suggestion, I will start working on a solution that flattens the JSON objects.

@JSCU-CNI we currently use the following index template, which fails if a record with the field example uses the data type object, after which a second record with the field example uses the data type text. Because an Elasticsearch field can't use the data type object and text at the same time.

Zawadidone commented 2 months ago

Relates to https://github.com/fox-it/dissect.target/issues/723

pyrco commented 2 months ago

@Zawadidone that would be a nice option to have for JSON output! Make sure though to make it configurable, as multiple adapters (currently splunk, jsonfile and elastic) use JsonRecordPacker, and not everybody expects the json to be flattened.