cloudflare / mitmengine

A MITM (monster-in-the-middle) detection tool. Used to build MALCOLM:
https://malcolm.cloudflare.com
BSD 3-Clause "New" or "Revised" License
799 stars 68 forks source link

Update mitmengine to use new fingerprints #4

Closed gabbifish closed 5 years ago

gabbifish commented 5 years ago

EDIT: fixed the issue below :) feel free to proceed to the second comment.

Hi there! I've added an additional test to _TestProcessorCheck() that takes a new signature (recently observed in ClickHouse) and currently prints out the corresponding Report. I wrote this extra test to ensure that new fingerprints are given expected Reports. The test itself still fails because I haven't populated the Report struct used as a sanity check; this is because I'm not sure what Report output I expect to see. (TL;DR this PR is a WIP; by the end of it I hope we are confident about the new browser signatures we are using for mitmengine, formalize how these signatures are collected, and have tests to ensure mitmengine Reports are as expected)

So far, I've run this new test with multiple types of browser.txt files used for fingerprint comparison. I was hoping this would give me clarity about what I should expect (and sanity check that the merged fingerprints produce the same expected output), but I was surprised to see a userAgent/fingerprint pair I thought was legitimate get deemed an "impossible" pair.

For browser_merge.txt (merged browser fingerprint file):

mitmengine.Report{
MatchedUASignature:"1:70:1:2::1:", 
BrowserSignature:"303:~a,2f,35,9c,ff:~0,5,d:::*:", 
BrowserSignatureMatch:0x1, 
Reason:"invalid_cipher", 
ReasonDetails:"~a,2f,35,9c,ff vs a,2f,35,9c,9d,1301,1302,1303,c013,c014,c02b,c02c,c02f,c030,cca8,cca9", 
BrowserGrade:0x0, 
ActualGrade:0x2, 
WeakCiphers:false, 
LosesPfs:false, 
MatchedMitmSignature:"", 
MatchedMitmName:"", 
MatchedMitmType:0x0, 
Error:error(nil)
}

^ So, does "impossible" mean that the traffic's fingerprint cannot possibly correspond to the presented userAgent? That's what I guess, given that invalid_cipher is given as a reason. This strikes me as weird, given that this was the sixth most popular userAgent/fingerprint string pair I observed as of late.

The non-overlapping cipher, 0xff, corresponds to TLS_EMPTY_RENEGOTIATION_INFO_SCSV (source)

For browser.txt (unmerged browser fingerprint file):

mitmengine.Report{
MatchedUASignature:"1:30-70:1:2::1:", 
BrowserSignature:"301:~?7,?a,?13,?16,2f,?32,?33,35,?38,?39,ff:~?0,?23:::*:", BrowserSignatureMatch:0x1, 
Reason:"invalid_version", 
ReasonDetails:"301 vs 303", 
BrowserGrade:0x0, 
ActualGrade:0x2, 
WeakCiphers:false, 
LosesPfs:false, 
MatchedMitmSignature:"", 
MatchedMitmName:"", 
MatchedMitmType:0x0, 
Error:error(nil)
}

^ This just looks plain wrong; how the heck does this fingerprint get compared to something with just TLS 1.0? I think this is because my version of browser.txt mixes inputs from clickhouse (which include ~ figures) and this results in weird comparison logic...

For browser_clickhouse.txt (current merged signatures used by clickhouse):

got: &errors.errorString{s:"unknown_user_agent"}

^ This explains why so much in our dashboard gets interpreted as "unknown" :) Very much symptomatic of inserter using outdated fingerprints.

The various browser*.txt files used above are included in testdata/mitmengine if you want to peek at them; we can remove them from git later.

gabbifish commented 5 years ago

I found the issue above--the logic for comparing version values had an off-by-one error, where exact version matches weren't matched at all. E.g.:

if a.Major == anyVersion || a.Major > fingerprint.Major {
    return true
}
if a.Major < fingerprint.Major {
    return false
}

to

if a.Major == anyVersion || a.Major >= fingerprint.Major {
    return true
}
if a.Major < fingerprint.Major {
    return false
}

I've submitted a fix for this in the PR. Now new fingerprints match :)

This PR now consists of three components:

  1. Minor refactoring and fixes
  2. Adding new tests for fingerprints to ensure sanity when fingerprint files are updated, plus one benchmark over the processor.Check() function.
  3. Adding a new "best matching" algorithm for fingerprints that cannot be matched to a browser, but should be matched to the closest fingerprint to ensure more accurate Report metrics about "closest match" browsers, like BrowserSignature, BrowserGrade, and Reason. Previously, the last browser record iterated over was, by default, considered the "closest match" browser. This new algorithm simply consists of counting maximum overlap between fingerprint ciphers, extensions, curves, and EC point format; the browser record with the highest count of overlap is treated as the "closest match" browser.

Let me know if you have any other questions! Otherwise I will work on another PR for optimizing record lookups :)