activecm / rita-legacy

Real Intelligence Threat Analytics (RITA) is a framework for detecting command and control communication through network traffic analysis.
GNU General Public License v3.0
2.51k stars 362 forks source link

Improve useragent aggregation runtime for datasets with many useragents #785

Closed Zalgo2462 closed 1 year ago

Zalgo2462 commented 1 year ago

If the number of useragents (including JA3 hashes and HTTP useragents) outnumbers the number of internal hosts, the existing useragent aggregation implementation runs slowly.

This PR rewrites the useragent aggregation to:

The previous implementation would:

The main benefits from the old implementation are:

I am currently running performance tests for this PR and will write back with numbers from before and after the patch.

Performance testing

I ran the old version and the new version through a one-off import of a dataset with ~3000 hosts and roughly 1 million user agents.

The old version did not finish within 8 hours.

The new version finished in a matter of seconds.

I ran the blocking profiler for a maximum of 30 seconds on a system with 16 cores while running the user agent aggregation on this dataset. The profile stopped after the 30 second mark when profiling the old system.

Click the following svg to see the profiling results from the old version. block-old

The profile stopped after ~6 seconds because the user agent aggregation finished up before the 30 second max was hit.

Click the following svg to see the profiling results from the new version block-new

While these results look really good for one-off imports, I'd like to find a way to test rolling imports with large amounts of user agent data. The code path for updating existing rare signature entries is likely slower than the one which creates new entries in the host collection.

Zalgo2462 commented 1 year ago

The number of rsig entries produced by the previous version and this version of the code are different for a dataset I'm working with. I'm not sure what is causing the difference yet.

Zalgo2462 commented 1 year ago

Part of the difference appears to be that the old version would create rsig entries for external hosts.

Running db.host.find({"dat.rsig": {"$exists": true}, "local": false}, {"ip": 1}) returns results for datasets processed with the old version, but it doesn't return results for datasets processed with the new version.

Also, looking at the output of the old version, I am seeing bugged documents in the host collection like:

{ "_id" : ObjectId("63dcac0c06de9b142f120c44"), "ip" : "[censored external IP address]", "network_uuid" : UUID("ffffffff-ffff-ffff-ffff-ffffffffffff"), "dat" : [ { "cid" : 0, "rsig" : "1320c641506f8e23a43933d6d7e0cc4d", "rsigc" : 1 } ] }

These bugged entries seem to be related to external -> external connections which were filtered out of the connections map returned by the parser. It looks like maybe they weren't filtered out from the useragent map when reading from the ssl log.