Performance issues after upgrading to 3.x.x

jsdjjsd commented 5 years ago

I am experiencing a huge increase in processing time since upgrading from 2.0.0 to 3.0.1, not quite certain on what the cause may be here, would be grateful for any input you may have on this. I have debugging enabled but cannot see any errors that may shed some light on the problem.

The dataset I'm testing on is 20 GB of conn-logs for a 24 hour period.

2.0.0 runs parsing and analysis in roughly 15 minutes, where 3.0.1 runs the same set in almost 10 hours. From what I can tell it's the parsing that is the culprit, not the analysis.

The system it runs on has decent resources, 40 cores, 192 GB RAM. The RITA configuration settings are the same for both versions, I usually set ImportBuffer to 100000.

Any idea what may be causing this increase in processing time?

ethack commented 5 years ago

@jsdjjsd Can you share the RITA log file for each of the times you've run it? The log file should be in /var/lib/rita/logs and should be named with the date you ran the command. The log file has timing for individual sections and may help us narrow down where the issue is. Could you also share the config file you are using?

jsdjjsd commented 5 years ago

@ethack Sure, thanks! I have attached the terminal output, configuration, and log files from 2.0.0 and 3.0.1.

I just ran it overnight and 3.0.1 took 620 minutes to finish, and then 2.0.0 in the morning taking 11 minutes. In general it seems to be the parsing that takes much longer. These are the files that I am testing on:

842M conn.00:00:00-01:00:00.log.gz - 21624409 lines
912M conn.01:00:00-02:00:00.log.gz - 23750409 lines
864M conn.02:00:00-03:00:00.log.gz - 22044009 lines
845M conn.03:00:00-04:00:00.log.gz - 21523325 lines
891M conn.04:00:00-05:00:00.log.gz - 22976009 lines
874M conn.05:00:00-06:00:00.log.gz - 22396009 lines
835M conn.06:00:00-07:00:00.log.gz - 20571609 lines
891M conn.07:00:00-08:00:00.log.gz - 21698409 lines
868M conn.08:00:00-09:00:00.log.gz - 20810009 lines
879M conn.09:00:00-10:00:00.log.gz - 21028009 lines
984M conn.10:00:00-11:00:00.log.gz - 24281209 lines
876M conn.11:00:00-12:00:00.log.gz - 21124009 lines
296M conn.12:00:00-13:00:00.log.gz - 7219415 lines
962M conn.12:00:00-13:07:18.log.gz - 23849609 lines
 37M conn.13:00:00-13:02:22.log.gz - 908009 lines
830M conn.13:07:38-14:00:00.log.gz - 20386767 lines
323M conn.14:00:00-14:20:40.log.gz - 7803445 lines
610M conn.14:21:05-15:00:00.log.gz - 15139545 lines
940M conn.15:00:00-16:00:00.log.gz - 23654009 lines
935M conn.16:00:00-17:00:00.log.gz - 23717609 lines
906M conn.17:00:00-18:00:00.log.gz - 23166409 lines
907M conn.18:00:00-19:00:00.log.gz - 22991609 lines
831M conn.19:00:00-20:00:00.log.gz - 20864009 lines
820M conn.20:00:00-21:00:00.log.gz - 20502809 lines
810M conn.21:00:00-22:00:00.log.gz - 20387209 lines
855M conn.22:00:00-23:00:00.log.gz - 21792809 lines
895M conn.23:00:00-00:00:00.log.gz - 22909209 lines

By the way, just for your information, RITA does not run on the same system as the Bro sensor that collects the data, it's offloaded to this system solely for RITA processing so it's not busy with anything else.

2.0.0_config.txt 2.0.0_log.txt 2.0.0_terminal_output.txt 3.0.1_config.txt 3.0.1_log.txt 3.0.1_terminal_output.txt

ethack commented 5 years ago

One of the biggest differences between v2 and v3 is that v2 the workflow is import then analyze, but for v3 these two steps are combined into import (there is no separate analyze in v3). Did your v2 analyze portion take about 10 hours to complete?

If not then let's dig further. To compare apples to apples, could you run rita analyze on your dataset with the v2 and send similar output and log file?

jsdjjsd commented 5 years ago

I just ran analyze using 2.0.0, it took about 4 minutes on the same dataset as above. I have attached the logs for this as well.

2.0.0_analyze_log.txt 2.0.0_analyze_terminal_output.txt

jsdjjsd commented 5 years ago

I ran a brief test on the conn logs for 12-13 today from a different sensor using both versions.

Version 2.0.0 took 7 minutes to import and 2,5 minutes to analyze
Version 3.0.1 took 9,5 minutes

So basically the same. But I noticed something, the real clock time is 7+2,5 minutes for 2.0.0 and 9,5 minutes for version 3.0.1, however the CPU time spent in user-mode is 4,5 minutes+0,5 minutes for 2.0.0 and 70 minutes for 3.0.1. So the time spent by the CPU actually processing is much, much more. I can see the same behaviour on my previously posted terminal output.

Version 2.0.0:

real    7m9.285s
user    4m33.894s
sys 0m15.795s

(import)

real    2m29.798s
user    0m34.511s
sys 0m12.626s

(analyze)

Version 3.0.1:

real    9m33.772s
user    69m55.398s
sys 3m41.507s

I'll try and pull a 24 hour set from the same sensor and run overnight.

(edit)

Okay, so I ran RITA on a 17 GB data set from another machine in my grid. Bro version is 2.6.1 on all my machines btw.

2.0.0:

real    117m27.574s
user    87m59.257s
sys     2m58.354s

(import)

real    47m9.547s
user    6m15.396s
sys     1m42.877s

(analyze)

3.0.1:

real    451m44.224s
user    1456m45.194s
sys     36m38.280s

ethack commented 5 years ago

@jsdjjsd I have some things for you to try.

Could you try importing the dataset in a series of chunks using the --rolling option? My only theory right now is that loading the dataset is using up too much memory and causing swapping in v3. This seems unlikely given that your system has a huge amount of ram, but it would be good to rule this out. I think something like this should do it:

for chunk in {1..12}; do time rita import --rolling --numchunks 12 --chunk $chunk /opt/rita/x/2019-04-23/ 2019-04-23-rolling; done

Another thing to try might be to set the threads manually to 1 and try against a single hour of conn logs. This might help determine if it's a threading issue with v3.

time rita-v2 import --threads 1 /opt/rita/x/2019-04-23/ 2019-04-23-v2-1thread
# rita analyze doesn't have the threads option so we can't check that
time rita-v3 import --threads 1 /opt/rita/x/2019-04-23/ 2019-04-23-v3-1thread

If you launch a mongo shell, does it show you some warnings? Specifically, something about numa or numactl? Here's an example of what it can look like from my machine, though my warning is about XFS.

$ mongo
MongoDB shell version v3.6.12
connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("cc1f8955-fadd-4c02-8e41-c96a83795f13") }
MongoDB server version: 3.6.11
Server has startup warnings: 
2019-04-04T17:14:32.177-0500 I STORAGE  [initandlisten] 
2019-04-04T17:14:32.177-0500 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2019-04-04T17:14:32.177-0500 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem

On the off chance this is a possibility, I have to ask: would you be willing to share one of your datasets via some private channel?

If the problem does happen to be with the parsing of the files, I'm currently working on revamping this section of the code. It would be really useful to track down this issue to make sure we avoid it but even if we don't identify it there's a chance that the new design won't have the same problem.

jsdjjsd commented 5 years ago

@ethack Thanks! I have now done some testing based on your feedback.

1

Running using the --rolling option as you suggested actually seems to have made things better, v3 is still not as fast as v2, but an improvement for sure. I ran it just now using v3 and it completed in 58 minutes. Played around a little with various number of chunks. I was running with swap, so I also tried to turn swap off, no effect.

--numchunks 6

real    95m10.300s
user    294m37.415s
sys     11m37.315s

--numchunks 12

real    58m32.465s
user    240m28.752s
sys     11m33.235s

--numchunks 24

real    68m40.561s
user    225m32.326s
sys     11m54.963s

2

I tried running v2 and v3 using --threads 1. First I ran it on the entire 24 hour set, and then a couple of times on a single hour. I chose one hour from the middle of the night, and one from the afternoon.

Version 2, entire 24 hour set

real    49m49.066s
user    62m18.732s
sys     1m16.280s

Version 3, entire 24 hour set

real    575m34.833s
user    748m6.930s
sys     11m48.912s

Version 2, one our of conn logs, 01-02

real    2m10.717s
user    2m55.278s
sys     0m4.032s

Version 3, one our of conn logs, 01-02

real    3m2.252s
user    9m27.295s
sys     0m30.921s

Version 2, one our of conn logs, 15-16

real    2m14.547s
user    3m4.695s
sys     0m4.092s

Version 3, one our of conn logs, 15-16

real    2m46.708s
user    8m46.388s
sys     0m27.445s

3

MongoDB is configured in Systemd to use numactl: ExecStart=/usr/bin/numactl --interleave=all -- /usr/bin/mongod --config /etc/mongod.conf

I have now also tried running without numactl, no difference. Apart from an access control warning, there's nothing else when i start a Mongo shell. I am, by the way, running MongoDB with default settings, but I have tried both with and without journaling, no difference.

4

About sharing the Bro logs, that will probably be tricky. But maybe I can get a set of Bro logs from a publically available PCAP archive and see if I can reproduce the problem so that we can look at the same capture data.

activecm / rita-legacy

Performance issues after upgrading to 3.x.x #456

1

2

3

4