Open jsdjjsd opened 5 years ago
@jsdjjsd Can you share the RITA log file for each of the times you've run it? The log file should be in /var/lib/rita/logs and should be named with the date you ran the command. The log file has timing for individual sections and may help us narrow down where the issue is. Could you also share the config file you are using?
@ethack Sure, thanks! I have attached the terminal output, configuration, and log files from 2.0.0 and 3.0.1.
I just ran it overnight and 3.0.1 took 620 minutes to finish, and then 2.0.0 in the morning taking 11 minutes. In general it seems to be the parsing that takes much longer. These are the files that I am testing on:
842M conn.00:00:00-01:00:00.log.gz - 21624409 lines
912M conn.01:00:00-02:00:00.log.gz - 23750409 lines
864M conn.02:00:00-03:00:00.log.gz - 22044009 lines
845M conn.03:00:00-04:00:00.log.gz - 21523325 lines
891M conn.04:00:00-05:00:00.log.gz - 22976009 lines
874M conn.05:00:00-06:00:00.log.gz - 22396009 lines
835M conn.06:00:00-07:00:00.log.gz - 20571609 lines
891M conn.07:00:00-08:00:00.log.gz - 21698409 lines
868M conn.08:00:00-09:00:00.log.gz - 20810009 lines
879M conn.09:00:00-10:00:00.log.gz - 21028009 lines
984M conn.10:00:00-11:00:00.log.gz - 24281209 lines
876M conn.11:00:00-12:00:00.log.gz - 21124009 lines
296M conn.12:00:00-13:00:00.log.gz - 7219415 lines
962M conn.12:00:00-13:07:18.log.gz - 23849609 lines
37M conn.13:00:00-13:02:22.log.gz - 908009 lines
830M conn.13:07:38-14:00:00.log.gz - 20386767 lines
323M conn.14:00:00-14:20:40.log.gz - 7803445 lines
610M conn.14:21:05-15:00:00.log.gz - 15139545 lines
940M conn.15:00:00-16:00:00.log.gz - 23654009 lines
935M conn.16:00:00-17:00:00.log.gz - 23717609 lines
906M conn.17:00:00-18:00:00.log.gz - 23166409 lines
907M conn.18:00:00-19:00:00.log.gz - 22991609 lines
831M conn.19:00:00-20:00:00.log.gz - 20864009 lines
820M conn.20:00:00-21:00:00.log.gz - 20502809 lines
810M conn.21:00:00-22:00:00.log.gz - 20387209 lines
855M conn.22:00:00-23:00:00.log.gz - 21792809 lines
895M conn.23:00:00-00:00:00.log.gz - 22909209 lines
By the way, just for your information, RITA does not run on the same system as the Bro sensor that collects the data, it's offloaded to this system solely for RITA processing so it's not busy with anything else.
2.0.0_config.txt 2.0.0_log.txt 2.0.0_terminal_output.txt 3.0.1_config.txt 3.0.1_log.txt 3.0.1_terminal_output.txt
One of the biggest differences between v2 and v3 is that v2 the workflow is import
then analyze
, but for v3 these two steps are combined into import
(there is no separate analyze
in v3). Did your v2 analyze
portion take about 10 hours to complete?
If not then let's dig further. To compare apples to apples, could you run rita analyze
on your dataset with the v2 and send similar output and log file?
I just ran analyze using 2.0.0, it took about 4 minutes on the same dataset as above. I have attached the logs for this as well.
I ran a brief test on the conn logs for 12-13 today from a different sensor using both versions.
So basically the same. But I noticed something, the real clock time is 7+2,5 minutes for 2.0.0 and 9,5 minutes for version 3.0.1, however the CPU time spent in user-mode is 4,5 minutes+0,5 minutes for 2.0.0 and 70 minutes for 3.0.1. So the time spent by the CPU actually processing is much, much more. I can see the same behaviour on my previously posted terminal output.
Version 2.0.0:
real 7m9.285s
user 4m33.894s
sys 0m15.795s
(import)
real 2m29.798s
user 0m34.511s
sys 0m12.626s
(analyze)
Version 3.0.1:
real 9m33.772s
user 69m55.398s
sys 3m41.507s
I'll try and pull a 24 hour set from the same sensor and run overnight.
(edit)
Okay, so I ran RITA on a 17 GB data set from another machine in my grid. Bro version is 2.6.1 on all my machines btw.
2.0.0:
real 117m27.574s
user 87m59.257s
sys 2m58.354s
(import)
real 47m9.547s
user 6m15.396s
sys 1m42.877s
(analyze)
3.0.1:
real 451m44.224s
user 1456m45.194s
sys 36m38.280s
@jsdjjsd I have some things for you to try.
Could you try importing the dataset in a series of chunks using the --rolling
option? My only theory right now is that loading the dataset is using up too much memory and causing swapping in v3. This seems unlikely given that your system has a huge amount of ram, but it would be good to rule this out. I think something like this should do it:
for chunk in {1..12}; do time rita import --rolling --numchunks 12 --chunk $chunk /opt/rita/x/2019-04-23/ 2019-04-23-rolling; done
Another thing to try might be to set the threads manually to 1 and try against a single hour of conn logs. This might help determine if it's a threading issue with v3.
time rita-v2 import --threads 1 /opt/rita/x/2019-04-23/ 2019-04-23-v2-1thread
# rita analyze doesn't have the threads option so we can't check that
time rita-v3 import --threads 1 /opt/rita/x/2019-04-23/ 2019-04-23-v3-1thread
If you launch a mongo shell, does it show you some warnings? Specifically, something about numa or numactl? Here's an example of what it can look like from my machine, though my warning is about XFS.
$ mongo
MongoDB shell version v3.6.12
connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("cc1f8955-fadd-4c02-8e41-c96a83795f13") }
MongoDB server version: 3.6.11
Server has startup warnings:
2019-04-04T17:14:32.177-0500 I STORAGE [initandlisten]
2019-04-04T17:14:32.177-0500 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2019-04-04T17:14:32.177-0500 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem
On the off chance this is a possibility, I have to ask: would you be willing to share one of your datasets via some private channel?
If the problem does happen to be with the parsing of the files, I'm currently working on revamping this section of the code. It would be really useful to track down this issue to make sure we avoid it but even if we don't identify it there's a chance that the new design won't have the same problem.
@ethack Thanks! I have now done some testing based on your feedback.
Running using the --rolling
option as you suggested actually seems to have made things better, v3 is still not as fast as v2, but an improvement for sure. I ran it just now using v3 and it completed in 58 minutes. Played around a little with various number of chunks. I was running with swap, so I also tried to turn swap off, no effect.
--numchunks 6
real 95m10.300s
user 294m37.415s
sys 11m37.315s
--numchunks 12
real 58m32.465s
user 240m28.752s
sys 11m33.235s
--numchunks 24
real 68m40.561s
user 225m32.326s
sys 11m54.963s
I tried running v2 and v3 using --threads 1
. First I ran it on the entire 24 hour set, and then a couple of times on a single hour. I chose one hour from the middle of the night, and one from the afternoon.
Version 2, entire 24 hour set
real 49m49.066s
user 62m18.732s
sys 1m16.280s
Version 3, entire 24 hour set
real 575m34.833s
user 748m6.930s
sys 11m48.912s
Version 2, one our of conn logs, 01-02
real 2m10.717s
user 2m55.278s
sys 0m4.032s
Version 3, one our of conn logs, 01-02
real 3m2.252s
user 9m27.295s
sys 0m30.921s
Version 2, one our of conn logs, 15-16
real 2m14.547s
user 3m4.695s
sys 0m4.092s
Version 3, one our of conn logs, 15-16
real 2m46.708s
user 8m46.388s
sys 0m27.445s
MongoDB is configured in Systemd to use numactl:
ExecStart=/usr/bin/numactl --interleave=all -- /usr/bin/mongod --config /etc/mongod.conf
I have now also tried running without numactl
, no difference. Apart from an access control warning, there's nothing else when i start a Mongo shell. I am, by the way, running MongoDB with default settings, but I have tried both with and without journaling, no difference.
About sharing the Bro logs, that will probably be tricky. But maybe I can get a set of Bro logs from a publically available PCAP archive and see if I can reproduce the problem so that we can look at the same capture data.
I am experiencing a huge increase in processing time since upgrading from 2.0.0 to 3.0.1, not quite certain on what the cause may be here, would be grateful for any input you may have on this. I have debugging enabled but cannot see any errors that may shed some light on the problem.
The dataset I'm testing on is 20 GB of conn-logs for a 24 hour period.
2.0.0 runs parsing and analysis in roughly 15 minutes, where 3.0.1 runs the same set in almost 10 hours. From what I can tell it's the parsing that is the culprit, not the analysis.
The system it runs on has decent resources, 40 cores, 192 GB RAM. The RITA configuration settings are the same for both versions, I usually set ImportBuffer to 100000.
Any idea what may be causing this increase in processing time?