DFIRKuiper / Kuiper

Digital Forensics Investigation Platform
738 stars 110 forks source link

Add parsers for Exchange logs and IIS logs #40

Closed heck-gd closed 2 years ago

heck-gd commented 2 years ago

This pull request adds parsers for textual log files found on Exchange servers, such as IIS logs, Exchange Control Panel logs and Exchange HttpProxy logs.

muteb commented 2 years ago

Awsome parsers and big thanks heck-gd. I always ask Salah for such parsers and from his defence is that if the logs of iis is so big, it will affect the perfoamnce of the platform and the elastic might crash. BTW, I looked into your profile and there is no much information about you . do you have email? or twitter account?.. please pass it to maintain your credit

heck-gd commented 2 years ago

Well, Salah is not wrong, we did have issues with very big IIS logs that basically made the Celery process run out of memory. One big problem with Kuiper is that it does not allow parsers to stream results via generators - everything has to be submitted in a single huge list at the end. Luckily most of the logs we have to process are smaller and thus not an issue.

I'm not looking for personal credit. You may credit G DATA Advanced Analytics GmbH.

salehmuhaysin commented 2 years ago

hi, Thank you for your contrubution, regarding the issue of large files with celery, Kuiper support two types of parsers results

The idea of file handle is to write the parser results into a file as small chunks to the disk so that it does not get stored in the memeory during the parsing, and then return the file handle, Kuiper will read the file as small chunks to push it into the database so it will not be stored in the memory.

Note: for huge files, before start parsing a file, Kuiper will check the file size, if the there is no space in the memory for file_size *1.5, then it will skip the file from parsing, it will come back to it once there is a space in the memory (this effect parsing memory specially if it is large dump), [this function validate the file space[(https://github.com/DFIRKuiper/Kuiper/blob/4059627957c299d11bdd2dea0aa54a4622732826/kuiper/app/controllers/parser_management.py#L730)

My concerns of large log data_type (such as IIS logs) on the database (ElasticSearch), ES uses the concept of scolling not pagination, so you cannot open the page 10000 directly, you need to go from 1 to 10000, this will be very costy for the performance, in Kuiper I used from on the query to ES which make it looks like a pagination, with small number of records you will not notice a performance issue or high memory utilization but with huge number of records it becomes worst. from my experiences in a server (32 cores, 64GB RAM - 32GB of them for ES JVM) the maximum number of records per case (which is ES index) is 35M records. more than that you will notice slowness in the search. Seperating the elasticsearch docker in different server and use multiple node in the cluster may solve the issue, but i did not try it.

heck-gd commented 2 years ago

Thank you for the clarification on supported parser outputs. I knew there was special logic for WinEvents but didn't know it was a general facility that could be used by any parser. I've changed the IIS log parser now. The Exchange log parser shouldn't need it because those logs are typically smaller (because they are split up by hour and not by day).

I'll say that it feels a bit clunky to have to write the results to a file first. Supporting this pattern makes sense for external tools like evtxdump where you simply cannot stream the results directly, but for other parsers that are implemented completely in Python, it would be far preferable to be able to use the yield syntax.

I see your point that a single ES node becomes a bottleneck at some point. However then it's up to the user to either provide sufficient hardware or simply not to ingest multi-gigabyte logs. It's not a valid argument for not offering such parsers in the software imo :)