Chainsaw uses a lot of RAM when processing large individual files with a large number of detections

KRUXLEX commented 2 years ago

In version 2.x I,m experiencing a out of memory issue. When i I analyze more that 3-4 files it start eating RAM memory. It's so hungy: obraz

Issue reproduce:

└─$ ~/git/chainsaw/chainsaw hunt ./dc-2/Logs -r ~/git/chainsaw/rules -s ~/git/chainsaw/sigma_rules -m ~/git/chainsaw/mappings/sigma-event-logs-all.yml --full --column-width 320 -o suspect_logs.dc-2.log

 ██████╗██╗  ██╗ █████╗ ██╗███╗   ██╗███████╗ █████╗ ██╗    ██╗
██╔════╝██║  ██║██╔══██╗██║████╗  ██║██╔════╝██╔══██╗██║    ██║
██║     ███████║███████║██║██╔██╗ ██║███████╗███████║██║ █╗ ██║
██║     ██╔══██║██╔══██║██║██║╚██╗██║╚════██║██╔══██║██║███╗██║
╚██████╗██║  ██║██║  ██║██║██║ ╚████║███████║██║  ██║╚███╔███╔╝
 ╚═════╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝ ╚══╝╚══╝
    By F-Secure Countercept (@FranticTyping, @AlexKornitzer)

[+] Loading detection rules from: /home/ubuntu/git/chainsaw/rules, /home/ubuntu/git/chainsaw/sigma_rules
[+] Loaded 7064 detection rules (782 not loaded)
[+] Loading event logs from: ./dc-2/Logs (extensions: .evtx)
[+] Loaded 24 EVTX files (3.1 GB)
[+] Hunting: [==========>-----------------------------] 6/24 ⠴
zsh: killed     ~/git/chainsaw/chainsaw hunt ./dc-2/Logs -r ~/git/chainsaw/rules -s  -m

┌─[2022-09-23 23:57:41]─(ubuntu㉿ubuntu)─[/srv]
└─$ echo $? 
137

Monitor term in hunt time:

┌─[2022-09-23 23:50:36]─(ubuntu㉿ubuntu)─[/srv]
└─$ free -h
               full       used       free    shared   buf/cache    avaliable
Memory:       31Gi        30Gi        229Mi       307Mi       3,7Gi        1,2Gi
SWAP:          0B          0B          0B

I think, the problem is with reporting detection. It's keep it in memory and after finish all analyze it push it to file. With one or 2-3 files is fine, but if we analyze multiple files there out of memory is possible. I think it should push result to file each confirmed hunt

alexkornitzer commented 2 years ago

It should be storing as little information as possible to keep RAM usage to a minimum but there must be some oversight somewhere. I'll profile it when I get some time and fix the issue if I can replicate it.

alexkornitzer commented 2 years ago

Quick one, do you get insane RAM usage if you output in json (-j)?

KRUXLEX commented 2 years ago

Just check. Yup on json is similar

alexkornitzer commented 2 years ago

Okay, thanks for checking I know exactly what the issue is. I'll try and get a solution in place this coming week.

KRUXLEX commented 2 years ago

Thanks, actually work around is bash "for loop" :D

alexkornitzer commented 2 years ago

@KRUXLEX , if you are able to test out the fix/memory branch that would be great (You will need to build with cargo build --release). Unfortunately this fix is always going to have to sacrifice the cost of speed for space, but the slowdown should not be too bad. There are some CPU optimisations that can be done to get closer to the original performance which I will try and do when I have time.

alexkornitzer commented 2 years ago

This is now in master.

KRUXLEX commented 1 year ago

Still it eating my RAM :) obraz

alexkornitzer commented 1 year ago

Can you please give me an example of what you are running Chainsaw on? The only case I can think of where it will eat that much RAM is on a very noisy ruleset (default sigma and the all mapping file) on a very large set of data. This is because it currently stores all hits in memory for presentation at the end.

What I could do to handle these cases is add an option that asks Chainsaw to page the hits to disk. This will result in a slower run but will use far less memory.

A temporary work around is to output the data as jsonl as this will stop the caching for terminal presentation.

To summaries, lob me your usecase and I will se what I can do.

jfstenuit commented 1 year ago

I think I've hit the same issue. Hunting on an old domain controller, 251 evtx files and a 3.9GB security event log. Default Sigma and all mapping file. Output to csv or jsonl doesn't make a difference.

alexkornitzer commented 1 year ago

That will be the issue, a 3.9GB will explode in size when in RAM, especially with the default sigma ruleset. And currently --jsonl only works at the per file boundary rather then each entry in a file.

I'll go to the drawing board and see what I can come up with current thinking, is to output --jsonl at the entry boundary. But again this would not solve the problem for normal output.

jfstenuit commented 1 year ago

Note that you can work around the issue by using --from and --to to only analyze a subset of events.

irfaan0999 commented 1 year ago

Hello,

It is using 50% CPU and more than 1gb RAM. I am using it between 5 minutes interval that matches events against 1302 sigma rules. Is it possible to optimize the program to not consume that much resources?

alexkornitzer commented 1 year ago

Right so there are things we can do here but they depend on users use cases, my current thoughts on potential solutions:

Add an option --page, which would store results to disk if a certain amount of RAM is exceeded, the caveat here would be speed, as paging to disk will slow chainsaw down.
Truly stream the output, but this would only be possible for output formats like json & jsonl.

I am happy to implement either or maybe both but only if they match peoples use cases as it will take me a bit of time to get these features in.

irfaan0999 commented 1 year ago

Seems like both are great solution. The use case is that the program will read event logs from past 5 minutes and matches them against sigma rules every 5 minutes and output them in json format. We would prefer stability of resources over speed.

alexkornitzer commented 1 year ago

Okay, let me see what I can do.

alexkornitzer commented 1 year ago

Right peeps, please give 2.7.0 a try using a combo of --cache-to-disk --jsonl. During my tests RAM usage for a 350MB file went from 490MB down to 314MB. Base RAM usage is 240MB for the default sigma ruleset due to how pre-processing of that data is being done for now, as again speed was prioritised over space.

irfaan0999 commented 1 year ago

Hello bro,

I cannot download the package. It is being flagged as trojan. I can download older versions without issue.

alexkornitzer commented 1 year ago

This will be the case if you download the all zipped bundle as it contains Sigma rules and poor AV engines will flag on them. The best way around this is to download the rules from Sigma and just get the binary for Chainsaw from releases.

irfaan0999 commented 1 year ago

I did a test and I notice the RAM usage is still high and CPU usage fluctuates. It reaches as high as 95%.

alexkornitzer commented 1 year ago

Thanks for testing, just some follow up questions. What command line arguments did you use? And what size 'evtx' did you run it on?

irfaan0999 commented 1 year ago

I launched the following command:

.\chainsaw.exe hunt C:\Windows\System32\winevt -s C:\rules --mapping C:\chainsaw\mappings\sigma-event-logs-all.yml --json --cache-to-disk

The total evtx size is 210MB.

Then I ran the command specifying a 5 minute timeframe and the CPU usage were around 50% and 1+gb ram.

irfaan0999 commented 1 year ago

I think it could also be nice if we were able to select a specific evtx file instead of a folder path. e.g selecting only 'Microsoft-Windows-Sysmon%4Operational.evtx' file because many sigma rules depends on sysmon.

alexkornitzer commented 1 year ago

You need to use --jsonl not --json, but Chainsaw should be preventing the invalid combinations anyway so that is a bug that needs fixing.

Chainsaw already supports selecting individual files, are you sure the path you used there is valid?

irfaan0999 commented 1 year ago

I achieve same results using jsonl. CPU and RAM usage still high. I confirm the path I used is valid. If I remove the file name, it loads all evtx within the folder but specifying a specific evtx file within the folder does not function. I have not tried chainsaw on linux. My use case is windows mostly.

alexkornitzer commented 1 year ago

Did you use single quotes? Powershell is notoriously bad with string escapes, there are examples of this in other closed issues.

CPU usage will be high, its a multi threaded application (you can set it to single threaded in the args), but with --cache-to-disk the RAM usage should be dropping down to around 500MB (when I tested on a 300MB sample). There is a chance that Windows is handling its memory differently to MacOS and Linux. Were you still getting 1GB+ even with -c and --jsonl?

I do ok the error so there could also be a chance that Windows is failing to create a temporary file, I will improve the debugging output for that too so that can be ruled out.

irfaan0999 commented 1 year ago

Using --num-threads reduced CPU usage by 50%. However, even with -c --jsonl the RAM usage is still the same (1gb+)

alexkornitzer commented 1 year ago

Okay so I have managed to optimise it a little further, that is present in v.2.7.1, there is probably some more I can do with it but I am waiting on a really large evtx to make optimisation easier.

irfaan0999 commented 1 year ago

RAM usage has decreased by 50%. It seems good.

WithSecureLabs / chainsaw

Chainsaw uses a lot of RAM when processing large individual files with a large number of detections #102