Closed Zawadidone closed 1 year ago
The variable self.descriptors
is empty, because the JSON file x01.json
does not contain a record descriptor. Because of that line 83 leads to a key error. https://github.com/fox-it/flow.record/blob/60f9a731681eb0550c404003c4f9151d439dc9b1/flow/record/jsonpacker.py#L83
Yes the descriptors are only dumped at the beginning of the json file (as a single entry/line). I think the best way to have proper support is to add a --split
option to rdump
.
So that feature would do the following:
rdump --split 6 mft.json
mft<suffix>.json
that starts with the record descriptor and the 6 lines (repeat until end of file).No, instead of reading I would mean you need to dump the records as splitted, eg: rdump --split 6 -w mft.json
or rdump --split 100 -w my.records.gz
.
It would then dump (the postfix numbering is up for discussion):
Then you can just read the record files normally as each file would then have the correct record descriptors in each file. Maybe you already meant this btw.
@Zawadidone I've implemented a SplitWriter
which allows rdump to do what what was discussed in this issue, please give https://github.com/fox-it/flow.record/pull/47 a test :)
Thank you for this feature, this speeds up the process a lot! Below a quick test on my laptop with Python3.
mkdir -p export/plugins
# Without --split it takes 75++ minutes
target-query --list | grep ' -' | grep 'output: records' | grep -vE 'walkfs|yara|remoteaccess\.remoteaccess|browser\.history|get_all_records|example' | awk '{print $1}' > FUNCTIONS
cat FUNCTIONS | xargs -I {} -P 50 sh -c 'target-query -q -t MSEDGEWIN10.tar -f {} --no-cache --children -j >export/plugins/{}.json 2>>/dev/null'
find export/plugins -type f -print0 | xargs -r0I {} -P 50 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .json).jsonl?descriptors=True'
# With --split it takes 45 minutes
target-query --list | grep ' -' | grep 'output: records' | grep -vE 'walkfs|yara|remoteaccess\.remoteaccess|browser\.history|get_all_records|example' | awk '{print $1}' > FUNCTIONS
cat FUNCTIONS | xargs -I {} -P 50 sh -c 'target-query -q -t MSEDGEWIN10.tar -f {} --no-cache --children 2>>/dev/null | rdump --split 10000 -w export/plugins/{}.json --suffix-length 10 2>>/dev/null'
find export/plugins -type f -print0 | xargs -r0I {} -P 50 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .json).jsonl?descriptors=True'
Nice benchmark! and nice to see those speed improvements due to parallelisation. Are you already using pypy
btw?
I also see you use jsonfile://*.jsonl
, I will add native extension type support when using.jsonl
as the extension so that it works the same as .json
and you can directly use rdump -w filename.jsonl
without specifying the scheme.
Not yet but that improvement is part of our backlog. Are there other improvements you use to improve the speed of target-query
and rdump
. We use it in a processing pipeline where every Acquire or Velociraptor "package" is processes by a single temporary VM and uploaded to Elasticsearch/Timesketch using Logstash.
Ah nice feature, thanks.
Currently we just do processing on a big beefy linux machine with beefy specs. It processes records per host/plugin just fine atm but the parallelisation that you are doing using splitted files might come handy for us in the future as well.
We use rdump's splunk://
writer to directly output records into Splunk.
Other things we do outside dissect / host forensics is write records directly as .avro
format onto a Google Cloud Storage bucket, and ingest them using BigQuery and query it there. That way we can query really fast, but still have original data that we can work with using rdump
.
With pypy
version 3.9 using the machine type c2d-highcpu-16 (16 CPU's and 32GB memory) it takes 22 minutes using the acquire
container MSEDGEWIN10_20220708124036.tar.
target-query --list | grep ' -' | grep 'output: records' | grep -vE 'yara|remoteaccess\.remoteaccess|browser\.history|get_all_records|example' | awk '{print $1}' > FUNCTIONS
mkdir export/plugins
# Old method takes 90 minutes
# 5 minutes
cat FUNCTIONS | xargs -I {} -P 14 sh -c 'target-query -q -t targets/* -f {} --no-cache --children -j 2>>/dev/null >export/plugins/{}.jsonl'
# 85 minutes
find export/plugins -type f -print0 | xargs -r0I {} -P 14 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .jsonl).jsonl?descriptors=True'
# New method takes 22 minutes
# 5 minutes
cat FUNCTIONS | xargs -I {} -P 14 sh -c 'target-query -q -t targets/* -f {} --no-cache --children 2>>/dev/null | rdump --split 10000 -w export/plugins/{}.jsonl --suffix-length 10 2>>/dev/null'
# 17 minutes
find export/plugins -type f -print0 | xargs -r0I {} -P 14 sh -c 'rdump {} --multi-timestamp -w jsonfile://export/$(basename {} .jsonl).jsonl?descriptors=True'
I still have to do some testing with the --suffix-length
, the max processes of xargs
and the machine type c2-standard-30.
@Zawadidone --suffix-length
will not have any effect on the speed, it only effects the filename. Or did you mean --split N
:)
Interesting benchmark, and nice that it can be done on deterministic data and hardware settings.
Please also try https://github.com/fox-it/flow.record/pull/51, it should speed up --multi-timestamp
significantly.
Yes I meant --split
, it was already late :p
I will do, thank for these enhancements.
Create splitted JSON files
Dump records. The first file outputed can be read with
rdump
, but all files after that result in a JSON key error.