Open m1357 opened 4 years ago
One thing I notice about the import is that it only parses one hour at a time. The cron job does not finish in time and begins to import the same hour repeatedly with every spawned job. Can this be done in parallel (i.e import multiple hours simultaneously)? Log snippet: ime="2020-09-04T11:00:02-07:00" level=info msg="The dataset [ zeek_24 ] is being converted to a rolling dataset." time="2020-09-04T11:01:33-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T11:03:02-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T11:59:57-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz" time="2020-09-04T12:00:01-07:00" level=info msg="Running Command: import" Arguments="[/data2/bro-cluster/logs/2020-09-04/ zeek_24]" Flag(rolling)=true time="2020-09-04T12:00:01-07:00" level=info msg="Importing [/data2/bro-cluster/logs/2020-09-04/]\n" time="2020-09-04T12:01:23-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T12:02:47-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T12:55:55-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz" time="2020-09-04T13:00:01-07:00" level=info msg="Running Command: import" Arguments="[/data2/bro-cluster/logs/2020-09-04/ zeek_24]" Flag(rolling)=true time="2020-09-04T13:00:02-07:00" level=info msg="Importing [/data2/bro-cluster/logs/2020-09-04/]\n" time="2020-09-04T13:01:31-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T13:02:50-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T13:58:56-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz" time="2020-09-04T14:00:01-07:00" level=info msg="Running Command: import" Arguments="[/data2/bro-cluster/logs/2020-09-04/ zeek_24]" Flag(rolling)=true time="2020-09-04T14:00:01-07:00" level=info msg="Importing [/data2/bro-cluster/logs/2020-09-04/]\n" time="2020-09-04T14:01:32-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/http.00:00:00-01:00:00.log.gz" time="2020-09-04T14:02:52-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/ssl.00:00:00-01:00:00.log.gz" time="2020-09-04T14:55:33-07:00" level=info msg="Running Command: show-beacons" Arguments="[zeek_24]" time="2020-09-04T14:55:49-07:00" level=info msg="Running Command: show-long-connections" Arguments="[zeek_24]" time="2020-09-04T14:56:10-07:00" level=info msg="Running Command: show-exploded-dns" Arguments="[zeek_24]" time="2020-09-04T14:57:27-07:00" level=info msg="Finished parsing file" path="/data2/bro-cluster/logs/2020-09-04/dns.00:00:00-01:00:00.log.gz"
For your second comment, your rita command's behavior is to import any new data in the specified directory. When rita is finished it stores hashes of the files so that it won't process the same ones again. If there is only one new hour's worth of logs in that directory since the last time rita finished then that's what it will process. But if there are more logs there then it will take all new log files and process them into the same "chunk".
If rita hasn't finished processing and starts again then it will most likely be processing more than one hour of data on subsequent runs. From your logs it looks like the dns log is taking nearly an hour to finish importing and the conn log doesn't seem to be importing at all. I'm guessing that either 1) you don't have a conn log in that directory, 2) the previous rita process is being killed by cron (I don't think cron does this), or 3) the conn log is taking longer than the 5 hours of logs you copied (very possible based on the size of your data and having concurrent rita jobs running).
To fix this I'd recommend using flock
to prevent rita from running again if it hasn't finished its previous run. Something like this should work (also consider using the --nonblock option on flock):
flock /tmp/rita.lock /usr/local/bin/rita import --threads 26 --rolling /zeek/logs/$(date --date='-1 hour' +%Y-%m-%d)/ zeek_24
As for opening the flood gates it does seem strange your CPU usage is so low. Your log files are gzipped and your log files are very large so just the decompressing step should cause the CPU to peg. Assuming you have Mongo installed on the same system (which you do if you used our install script to install rita and Mongo) then that should consume CPU during the import as well. I'm not quite sure what's going on here but my best guess is there might be disk read deadlocks if you have multiple concurrent rita jobs running (since the previous ones haven't finished yet). If that's the case then using flock
and rebooting or killing any current rita processes will help with that too.
Just the fact that dns is taking so long to import, conn never finishes, and you said your logs are 1 TB makes me think that even fixing any deadlocks, that you're reaching the limit of what rita can handle. The next steps I'd recommend would be:
NeverInclude
and InternalSubnets
.flock /tmp/rita.lock /usr/local/bin/rita import --threads 26 --rolling /zeek/logs/$(date --date='-1 hour' +%Y-%m-%d)/conn.*.gz /zeek/logs/$(date --date='-1 hour' +%Y-%m-%d)/dns.*.gz zeek_24
I believe these are the mappings between log files and the show-
commands for your reference:
Sorry I haven't had a lot of time to work on this but a bit more data. I timed one day of data import: real 31365m41.633s user 65622m31.321s sys 6244m41.171s
The Rita dataset for the day in mongo is 71GB.
My question on the import step is does this need to be done in a linear fashion or can we import multiple files simultaneously?
On the cron issue the first file never finishes in the allotted time and so the import step begins processing the same file again.
That dataset sounds way bigger than anything I've encountered. You did say you have a large Zeek cluster, but just to rule out a config issue would you be willing to share the contents of the Filtering
section in your /etc/rita/config.yaml
file? That would have the values in youNeverInclude
, AlwaysInclude
, and InternalSubnets
sections. If the IPs are sensitive feel free to redact them with x's but leave as much as you can. (e.g. 10.50.3.0/24 => 10.x.x.x/24)
V4 seems to be processing much faster and utilizing hardware (seeing all 64 cores active now when processing) Here is the filter section: Filtering:
# currently do not apply to dns or http logs.
# A good reference for networks you may wish to consider is RFC 5735.
# https://tools.ietf.org/html/rfc5735#section-4
# Example: AlwaysInclude: ["192.168.1.2/32"]
# This functionality overrides the NeverInclude and InternalSubnets
# section, making sure that any connection records containing addresses from
# this range are kept and not filtered
AlwaysInclude: []
# Example: NeverInclude: ["255.255.255.255/32"]
# This functions as a whitelisting setting, and connections involving
# ranges entered into this section are filtered out at import time
NeverInclude:
- 0.0.0.0/32 # "This" Host RFC 1122, Section 3.2.1.3
- 127.0.0.0/8 # Loopback RFC 1122, Section 3.2.1.3
- 169.254.0.0/16 # Link Local RFC 3927
- 224.0.0.0/4 # Multicast RFC 3171
- 255.255.255.255/32 # Limited Broadcast RFC 919, Section 7
- ::1/128 # Loopback RFC 4291, Section 2.5.3
- fe80::/10 # Link local RFC 4291, Section 2.5.6
- ff00::/8 # Multicast RFC 4291, Section 2.7
# Example: InternalSubnets: ["10.0.0.0/8","172.16.0.0/12","192.168.0.0/16"]
# This allows a user to identify their internal network, which will result
# in any internal to internal and external to external connections being
# filtered out at import time. Reasonable defaults are provided below
# but need to be manually verified against each installation before enabling.
InternalSubnets:
- 172.16.0.0/12
- x.x.0.0/16
- x.x.0.0/16
- xxxx:xxxx::/32
- xxx.64.0.0/16
- xxx.xxx.220.0/24
- xxx.xx.223.0/24
- 192.168.0.0/16
- xxx.xx.107.0/25
- xxx.xx.0.0/14
- xxxx:xxx:f108::/48
- 10.0.0.0/8
Hi All, I have a relatively large zeek cluster and am trying to pipe logs to rita. I generate ~1TB a day of logs and have a machine dedicated to running rita with 32 cores 256GB of ram on centos 7. I've tried launching the rolling import using :+1: /usr/local/bin/rita import --threads 26 --rolling /zeek/logs/$(date --date='-1 hour' +\%Y-\%m-\%d)/ zeek_24 but CPU usage seems to remain relatively low , around 1.7-3% of total available. Are there other tuning points I've missed to open up the flood gates?