rdump splitted JSON files leads to key error

Zawadidone commented 1 year ago

Create splitted JSON files

target-query -f mft -t MSEDGEWIN10.tar --limit 10 -j > mft.json

# Split single file in two files
split -dl 6 mft.json
mv x00 x00.json
mv x01 x01.json

Dump records. The first file outputed can be read with rdump, but all files after that result in a JSON key error.

 rdump x00.json 
<filesystem/ntfs/mft/std hostname='MSEDGEWIN10' domain=None creation_time=2019-03-19 21:52:25.169411 last_modification_time=2019-03-19 21:52:25.169411 last_change_time=2019-03-19 21:52:25.169411 last_access_time=2019-03-19 21:52:25.169411 segment=0 path='c:/$MFT' owner='S-1-5-18' filesize=0.12 GB resident=False inuse=True volume_uuid=None>
[...]

rdump x01.json 
2023-01-23 15:31:32,042 WARNING Exception in <flow.record.adapter.jsonfile.JsonfileReader object at 0x10a0dec50> for 'x01.json': KeyError(('filesystem/ntfs/mft/std', 623539933)) -- skipping to next reader

Zawadidone commented 1 year ago

The variable self.descriptors is empty, because the JSON file x01.json does not contain a record descriptor. Because of that line 83 leads to a key error. https://github.com/fox-it/flow.record/blob/60f9a731681eb0550c404003c4f9151d439dc9b1/flow/record/jsonpacker.py#L83

yunzheng commented 1 year ago

Yes the descriptors are only dumped at the beginning of the json file (as a single entry/line). I think the best way to have proper support is to add a --split option to rdump.

Zawadidone commented 1 year ago

So that feature would do the following:

rdump --split 6 mft.json
Read the record descriptor
Read 6 lines and create file mft<suffix>.json that starts with the record descriptor and the 6 lines (repeat until end of file).

yunzheng commented 1 year ago

No, instead of reading I would mean you need to dump the records as splitted, eg: rdump --split 6 -w mft.json or rdump --split 100 -w my.records.gz.

It would then dump (the postfix numbering is up for discussion):

mft.1.json or my.records.1.gz
mft.2.json or my.records.2.gz
etc

Then you can just read the record files normally as each file would then have the correct record descriptors in each file. Maybe you already meant this btw.

yunzheng commented 1 year ago

@Zawadidone I've implemented a SplitWriter which allows rdump to do what what was discussed in this issue, please give https://github.com/fox-it/flow.record/pull/47 a test :)

Zawadidone commented 1 year ago

Thank you for this feature, this speeds up the process a lot! Below a quick test on my laptop with Python3.

mkdir -p export/plugins

# Without --split it takes 75++ minutes
target-query --list | grep ' -' | grep 'output: records' | grep -vE 'walkfs|yara|remoteaccess\.remoteaccess|browser\.history|get_all_records|example' | awk '{print $1}' > FUNCTIONS
cat FUNCTIONS | xargs -I {} -P 50 sh -c 'target-query -q -t MSEDGEWIN10.tar -f {} --no-cache --children -j >export/plugins/{}.json 2>>/dev/null'
find export/plugins -type f -print0 | xargs -r0I {} -P 50 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .json).jsonl?descriptors=True'

# With --split it takes 45 minutes
target-query --list | grep ' -' | grep 'output: records' | grep -vE 'walkfs|yara|remoteaccess\.remoteaccess|browser\.history|get_all_records|example' | awk '{print $1}' > FUNCTIONS
cat FUNCTIONS | xargs -I {} -P 50 sh -c 'target-query -q -t MSEDGEWIN10.tar -f {} --no-cache --children 2>>/dev/null | rdump --split 10000 -w export/plugins/{}.json --suffix-length 10 2>>/dev/null'
find export/plugins -type f -print0 | xargs -r0I {} -P 50 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .json).jsonl?descriptors=True'

yunzheng commented 1 year ago

Nice benchmark! and nice to see those speed improvements due to parallelisation. Are you already using pypy btw?

I also see you use jsonfile://*.jsonl, I will add native extension type support when using.jsonl as the extension so that it works the same as .json and you can directly use rdump -w filename.jsonl without specifying the scheme.

Zawadidone commented 1 year ago

Not yet but that improvement is part of our backlog. Are there other improvements you use to improve the speed of target-query and rdump. We use it in a processing pipeline where every Acquire or Velociraptor "package" is processes by a single temporary VM and uploaded to Elasticsearch/Timesketch using Logstash.

Ah nice feature, thanks.

yunzheng commented 1 year ago

Currently we just do processing on a big beefy linux machine with beefy specs. It processes records per host/plugin just fine atm but the parallelisation that you are doing using splitted files might come handy for us in the future as well.

We use rdump's splunk:// writer to directly output records into Splunk.

Other things we do outside dissect / host forensics is write records directly as .avro format onto a Google Cloud Storage bucket, and ingest them using BigQuery and query it there. That way we can query really fast, but still have original data that we can work with using rdump.

Zawadidone commented 1 year ago

With pypy version 3.9 using the machine type c2d-highcpu-16 (16 CPU's and 32GB memory) it takes 22 minutes using the acquire container MSEDGEWIN10_20220708124036.tar.

target-query --list | grep ' -' | grep 'output: records' | grep -vE 'yara|remoteaccess\.remoteaccess|browser\.history|get_all_records|example' | awk '{print $1}' > FUNCTIONS
mkdir export/plugins

# Old method takes 90 minutes
# 5 minutes
cat FUNCTIONS | xargs -I {} -P 14 sh -c 'target-query -q -t targets/* -f {} --no-cache --children -j 2>>/dev/null >export/plugins/{}.jsonl'
# 85 minutes
find export/plugins -type f -print0 | xargs -r0I {} -P 14 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .jsonl).jsonl?descriptors=True'

# New method takes 22 minutes
# 5 minutes
cat FUNCTIONS | xargs -I {} -P 14 sh -c 'target-query -q -t targets/* -f {} --no-cache --children 2>>/dev/null | rdump --split 10000 -w export/plugins/{}.jsonl --suffix-length 10 2>>/dev/null'
# 17 minutes
find export/plugins -type f -print0 | xargs -r0I {} -P 14 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .jsonl).jsonl?descriptors=True'

I still have to do some testing with the --suffix-length, the max processes of xargs and the machine type c2-standard-30.

yunzheng commented 1 year ago

@Zawadidone --suffix-length will not have any effect on the speed, it only effects the filename. Or did you mean --split N :)

Interesting benchmark, and nice that it can be done on deterministic data and hardware settings. Please also try https://github.com/fox-it/flow.record/pull/51, it should speed up --multi-timestamp significantly.

Zawadidone commented 1 year ago

Yes I meant --split, it was already late :p

I will do, thank for these enhancements.

fox-it / flow.record

rdump splitted JSON files leads to key error #46