Improve useragent aggregation runtime for datasets with many useragents

Zalgo2462 commented 1 year ago

If the number of useragents (including JA3 hashes and HTTP useragents) outnumbers the number of internal hosts, the existing useragent aggregation implementation runs slowly.

This PR rewrites the useragent aggregation to:

loop over each internal host in the current import session:
- search the useragent collection for the useragents associated with the host in this import session which are associated with less than 5 originating IP addresses over the past 24 hours
- pull down the existing rare signatures associated with the host from the host collection
- loop over the new rare signatures:
- determine if each new rare signature matches an existing rare signature or not
- create an array of bulk updates for the host collection responsible for updating each existing rare signature
- create a single update for the host collection which pushes in each new rare signature

The previous implementation would:

loop over each useragent in the current import session:
- search for the originating IPs associated with the useragent in the useragent collection
- check that there are less than 5 originating IPs associated with the useragent over the past 24 hours
- loop over each originating IP associated with the useragent:
- check if this useragent has already been marked as a rare signature under the IPs record in the host collection
- create an update to either update the existing entry for push a new entry for the rare signature in the host collection

The main benefits from the old implementation are:

We don't need to run a MongoDB search for every useragent in the current import session
We don't need to repeatedly pull down each IP's host record when determining whether each rare signature already exists or not
We are able to batch all of the initial updates which push in new rare signatures into the host collection

I am currently running performance tests for this PR and will write back with numbers from before and after the patch.

Performance testing

I ran the old version and the new version through a one-off import of a dataset with ~3000 hosts and roughly 1 million user agents.

The old version did not finish within 8 hours.

The new version finished in a matter of seconds.

I ran the blocking profiler for a maximum of 30 seconds on a system with 16 cores while running the user agent aggregation on this dataset. The profile stopped after the 30 second mark when profiling the old system.

Click the following svg to see the profiling results from the old version. block-old

The profile stopped after ~6 seconds because the user agent aggregation finished up before the 30 second max was hit.

Click the following svg to see the profiling results from the new version block-new

While these results look really good for one-off imports, I'd like to find a way to test rolling imports with large amounts of user agent data. The code path for updating existing rare signature entries is likely slower than the one which creates new entries in the host collection.

Zalgo2462 commented 1 year ago

The number of rsig entries produced by the previous version and this version of the code are different for a dataset I'm working with. I'm not sure what is causing the difference yet.

Zalgo2462 commented 1 year ago

Part of the difference appears to be that the old version would create rsig entries for external hosts.

Running db.host.find({"dat.rsig": {"$exists": true}, "local": false}, {"ip": 1}) returns results for datasets processed with the old version, but it doesn't return results for datasets processed with the new version.

Also, looking at the output of the old version, I am seeing bugged documents in the host collection like:

{ "_id" : ObjectId("63dcac0c06de9b142f120c44"), "ip" : "[censored external IP address]", "network_uuid" : UUID("ffffffff-ffff-ffff-ffff-ffffffffffff"), "dat" : [ { "cid" : 0, "rsig" : "1320c641506f8e23a43933d6d7e0cc4d", "rsigc" : 1 } ] }

These bugged entries seem to be related to external -> external connections which were filtered out of the connections map returned by the parser. It looks like maybe they weren't filtered out from the useragent map when reading from the ssl log.

activecm / rita-legacy

Improve useragent aggregation runtime for datasets with many useragents #785

Performance testing