m1357 commented 4 years ago

Hi All, I have a relatively large zeek cluster and am trying to pipe logs to rita. I generate ~1TB a day of logs and have a machine dedicated to running rita with 32 cores 256GB of ram on centos 7. I've tried launching the rolling import using :+1: /usr/local/bin/rita import --threads 26 --rolling /zeek/logs/$(date --date='-1 hour' +\%Y-\%m-\%d)/ zeek_24 but CPU usage seems to remain relatively low , around 1.7-3% of total available. Are there other tuning points I've missed to open up the flood gates?

m1357 commented 4 years ago

One thing I notice about the import is that it only parses one hour at a time. The cron job does not finish in time and begins to import the same hour repeatedly with every spawned job. Can this be done in parallel (i.e import multiple hours simultaneously)? Log snippet: ime="2020-09-04T11:00:02-07:00" level=info msg="The dataset [ zeek_24 ] is being converted to a rolling dataset." time="2020-09-04T11:01:33-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T11:03:02-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T11:59:57-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz" time="2020-09-04T12:00:01-07:00" level=info msg="Running Command: import" Arguments="[/data2/bro-cluster/logs/2020-09-04/ zeek_24]" Flag(rolling)=true time="2020-09-04T12:00:01-07:00" level=info msg="Importing [/data2/bro-cluster/logs/2020-09-04/]\n" time="2020-09-04T12:01:23-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T12:02:47-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T12:55:55-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz" time="2020-09-04T13:00:01-07:00" level=info msg="Running Command: import" Arguments="[/data2/bro-cluster/logs/2020-09-04/ zeek_24]" Flag(rolling)=true time="2020-09-04T13:00:02-07:00" level=info msg="Importing [/data2/bro-cluster/logs/2020-09-04/]\n" time="2020-09-04T13:01:31-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T13:02:50-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T13:58:56-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz" time="2020-09-04T14:00:01-07:00" level=info msg="Running Command: import" Arguments="[/data2/bro-cluster/logs/2020-09-04/ zeek_24]" Flag(rolling)=true time="2020-09-04T14:00:01-07:00" level=info msg="Importing [/data2/bro-cluster/logs/2020-09-04/]\n" time="2020-09-04T14:01:32-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T14:02:52-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T14:55:33-07:00" level=info msg="Running Command: show-beacons" Arguments="[zeek_24]" time="2020-09-04T14:55:49-07:00" level=info msg="Running Command: show-long-connections" Arguments="[zeek_24]" time="2020-09-04T14:56:10-07:00" level=info msg="Running Command: show-exploded-dns" Arguments="[zeek_24]" time="2020-09-04T14:57:27-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz"

ethack commented 4 years ago

For your second comment, your rita command's behavior is to import any new data in the specified directory. When rita is finished it stores hashes of the files so that it won't process the same ones again. If there is only one new hour's worth of logs in that directory since the last time rita finished then that's what it will process. But if there are more logs there then it will take all new log files and process them into the same "chunk".

If rita hasn't finished processing and starts again then it will most likely be processing more than one hour of data on subsequent runs. From your logs it looks like the dns log is taking nearly an hour to finish importing and the conn log doesn't seem to be importing at all. I'm guessing that either 1) you don't have a conn log in that directory, 2) the previous rita process is being killed by cron (I don't think cron does this), or 3) the conn log is taking longer than the 5 hours of logs you copied (very possible based on the size of your data and having concurrent rita jobs running).

To fix this I'd recommend using flock to prevent rita from running again if it hasn't finished its previous run. Something like this should work (also consider using the --nonblock option on flock):

flock /tmp/rita.lock /usr/local/bin/rita import --threads 26 --rolling /zeek/logs/$(date --date='-1 hour' +%Y-%m-%d)/ zeek_24

As for opening the flood gates it does seem strange your CPU usage is so low. Your log files are gzipped and your log files are very large so just the decompressing step should cause the CPU to peg. Assuming you have Mongo installed on the same system (which you do if you used our install script to install rita and Mongo) then that should consume CPU during the import as well. I'm not quite sure what's going on here but my best guess is there might be disk read deadlocks if you have multiple concurrent rita jobs running (since the previous ones haven't finished yet). If that's the case then using flock and rebooting or killing any current rita processes will help with that too.

Just the fact that dns is taking so long to import, conn never finishes, and you said your logs are 1 TB makes me think that even fixing any deadlocks, that you're reaching the limit of what rita can handle. The next steps I'd recommend would be:

Tuning your filtering config to exclude everything you can with NeverInclude and InternalSubnets.
Consider decreasing your Strobe limit. This will reduce the amount of processing needed when doing beaconing analysis by excluding results with the most connections.
Turn off database logging if you aren't using that.
Consider disabling analysis steps you don't care about such as Blacklisted or UserAgent.
Another way to disable certain analysis modules is by being selective with the log files you pass to rita. Currently, rita will process the conn, dns, http, and ssl Zeek logs. If you don't want to use the analysis modules the log file supports you can just not pass that log file to rita. To do this you'd have to change from passing a directory to rita to passing file paths instead. For example:

flock /tmp/rita.lock /usr/local/bin/rita import --threads 26 --rolling /zeek/logs/$(date --date='-1 hour' +%Y-%m-%d)/conn.*.gz /zeek/logs/$(date --date='-1 hour' +%Y-%m-%d)/dns.*.gz zeek_24

I believe these are the mappings between log files and the show- commands for your reference:

conn - beacons, bl-source-ips, bl-dest-ips, long-connections, strobes
dns - exploded-dns, bl-hostnames
http - useragents, bl-hostnames (possibly, though I can't remember for sure)
ssl - Currently unused

m1357 commented 4 years ago

Sorry I haven't had a lot of time to work on this but a bit more data. I timed one day of data import: real 31365m41.633s user 65622m31.321s sys 6244m41.171s

The Rita dataset for the day in mongo is 71GB.

My question on the import step is does this need to be done in a linear fashion or can we import multiple files simultaneously?

On the cron issue the first file never finishes in the allotted time and so the import step begins processing the same file again.

ethack commented 4 years ago

That dataset sounds way bigger than anything I've encountered. You did say you have a large Zeek cluster, but just to rule out a config issue would you be willing to share the contents of the Filtering section in your /etc/rita/config.yaml file? That would have the values in youNeverInclude, AlwaysInclude, and InternalSubnets sections. If the IPs are sensitive feel free to redact them with x's but leave as much as you can. (e.g. 10.50.3.0/24 => 10.x.x.x/24)

m1357 commented 3 years ago

V4 seems to be processing much faster and utilizing hardware (seeing all 64 cores active now when processing) Here is the filter section: Filtering:

These are filters that affect the import of connection logs. They

# currently do not apply to dns or http logs.
# A good reference for networks you may wish to consider is RFC 5735.
# https://tools.ietf.org/html/rfc5735#section-4

# Example: AlwaysInclude: ["192.168.1.2/32"]
# This functionality overrides the NeverInclude and InternalSubnets
# section, making sure that any connection records containing addresses from
# this range are kept and not filtered
AlwaysInclude: []

# Example: NeverInclude: ["255.255.255.255/32"]
# This functions as a whitelisting setting, and connections involving
# ranges entered into this section are filtered out at import time
NeverInclude:
 - 0.0.0.0/32          # "This" Host           RFC 1122, Section 3.2.1.3
 - 127.0.0.0/8         # Loopback              RFC 1122, Section 3.2.1.3
 - 169.254.0.0/16      # Link Local            RFC 3927
 - 224.0.0.0/4         # Multicast             RFC 3171
 - 255.255.255.255/32  # Limited Broadcast     RFC 919, Section 7
 - ::1/128             # Loopback              RFC 4291, Section 2.5.3
 - fe80::/10           # Link local            RFC 4291, Section 2.5.6
 - ff00::/8            # Multicast             RFC 4291, Section 2.7

# Example: InternalSubnets: ["10.0.0.0/8","172.16.0.0/12","192.168.0.0/16"]
# This allows a user to identify their internal network, which will result
# in any internal to internal and external to external connections being
# filtered out at import time. Reasonable defaults are provided below
# but need to be manually verified against each installation before enabling.
InternalSubnets:
 - 172.16.0.0/12      
 - x.x.0.0/16     
 - x.x.0.0/16     
 - xxxx:xxxx::/32     
 - xxx.64.0.0/16      
 - xxx.xxx.220.0/24    
 - xxx.xx.223.0/24    
 - 192.168.0.0/16     
 - xxx.xx.107.0/25    
 - xxx.xx.0.0/14     
 - xxxx:xxx:f108::/48 
 - 10.0.0.0/8

activecm / rita-legacy

Large bro data sets #584

These are filters that affect the import of connection logs. They