Closed Zalgo2462 closed 1 year ago
The number of rsig
entries produced by the previous version and this version of the code are different for a dataset I'm working with. I'm not sure what is causing the difference yet.
Part of the difference appears to be that the old version would create rsig
entries for external hosts.
Running db.host.find({"dat.rsig": {"$exists": true}, "local": false}, {"ip": 1})
returns results for datasets processed with the old version, but it doesn't return results for datasets processed with the new version.
Also, looking at the output of the old version, I am seeing bugged documents in the host collection like:
{ "_id" : ObjectId("63dcac0c06de9b142f120c44"), "ip" : "[censored external IP address]", "network_uuid" : UUID("ffffffff-ffff-ffff-ffff-ffffffffffff"), "dat" : [ { "cid" : 0, "rsig" : "1320c641506f8e23a43933d6d7e0cc4d", "rsigc" : 1 } ] }
These bugged entries seem to be related to external -> external connections which were filtered out of the connections map returned by the parser. It looks like maybe they weren't filtered out from the useragent map when reading from the ssl log.
If the number of useragents (including JA3 hashes and HTTP useragents) outnumbers the number of internal hosts, the existing useragent aggregation implementation runs slowly.
This PR rewrites the useragent aggregation to:
useragent
collection for the useragents associated with the host in this import session which are associated with less than 5 originating IP addresses over the past 24 hourshost
collectionhost
collection responsible for updating each existing rare signaturehost
collection which pushes in each new rare signatureThe previous implementation would:
useragent
collectionhost
collectionhost
collectionThe main benefits from the old implementation are:
host
record when determining whether each rare signature already exists or nothost
collectionI am currently running performance tests for this PR and will write back with numbers from before and after the patch.
Performance testing
I ran the old version and the new version through a one-off import of a dataset with ~3000 hosts and roughly 1 million user agents.
The old version did not finish within 8 hours.
The new version finished in a matter of seconds.
I ran the blocking profiler for a maximum of 30 seconds on a system with 16 cores while running the user agent aggregation on this dataset. The profile stopped after the 30 second mark when profiling the old system.
Click the following svg to see the profiling results from the old version.
The profile stopped after ~6 seconds because the user agent aggregation finished up before the 30 second max was hit.
Click the following svg to see the profiling results from the new version
While these results look really good for one-off imports, I'd like to find a way to test rolling imports with large amounts of user agent data. The code path for updating existing rare signature entries is likely slower than the one which creates new entries in the host collection.