activecm / rita-legacy

Real Intelligence Threat Analytics (RITA) is a framework for detecting command and control communication through network traffic analysis.
GNU General Public License v3.0
2.51k stars 362 forks source link

Update to bimodal portion of the histogram score #794

Closed lisaSW closed 1 year ago

lisaSW commented 1 year ago

Problem

The histogram portion of the beacon scoring attempts to judge the shape of the hourly connection count distribution numerically rather than leaving it for visual determination. Currently it detects single flat sections or uniform distribution by finding the coefficient of variation of the connection frequency table and detects dual flat sections by identifying bimodal cases in the histogram of the frequency table counts. The lack of jitter tolerance in this method has caused some valid beacons to be scored lower than they should have been. image

This PR contains the following update to address this issue:

Instead of looking at single modes, we allow for some jitter by grouping them into buckets. Where before we were looking for 4 or fewer, we are now looking at how much of our data is captured in those modes instead of just counting modes.

Examples

Case 1: Uniform Distribution with Jitter and Gaps

In this case we have one mode and by bucketing our data we are hoping to find that mode. Since we are adding the two buckets together, the second one is redundant - it doesn't hurt but doesn't add anything. The cv score scored low in this case due to open gaps in the dataset and the original multimodal score scored low due to the jitter.

image

imageimage

Case 2: Alternating Consistent Frequency

In this case our data is truly bimodal and we are going to identify it by combining totals from the two largest buckets in the frequency count histogram. The cv score will always score low for cases like this. Since there is no jitter, the new score is identical to the old version and shows that the update is sufficient in replacing the original version for bimodal detection.

image

Case 3: Uniform Distribution with Jitter, Gaps, and Unfortunate Bucket Split

All of our data fits into one mode, but it is a little bit noisy and because of the bucket size that we chose, it gets split into 2 buckets. In this case, we are taking the top 2 buckets which allows those two to be treated as one and scored appropriately. As with case 1, the cv score scored low due to the open gaps in the dataset and the original multimodal score scored low due to the jitter.

image

Case 4: Uniform Distribution with Outlier

In this case we have a uniform distribution and a big outlier. The coefficient of variance is affected by this outlier and scores lower than it should have. The bimodal detection handles this case well because it throws out one potential outlier, and while outlier detection could be added to the cv score, it may add unnecessary computation time since the new subscore handles this case.

image

Testing