Open roguism opened 8 years ago
This is excellent work! Give me some time to digest everything you have said here, you have clearly put a lot of thought into it. We did some experiments in the vein of suggestion 2 but as you predicted it caused a fair number of false positives without much of a privacy gain, I will read through your other suggestions.
Thanks, a couple further notes:
So I think that the fourth approach could be made to work for keys and values if we also keep a whitelist of low entropy cookie values (which we currently have, and it includes language codes). I think it's safe to assume that benign developers will not change cookie names very frequently. I do however still think that we should make some estimate of the entropy of a potential tracking cookie before we mark it as tracking. For example, a cookie could hit the following values {3, 4, 5} and still not be useful as a tracking cookie and so we would not want to mark it as such.
From Tracking the Pixels: Detecting Unknown Web Trackers via Analysing Invisible Pixels, a crawl of 8,744 top Alexa domains:
The cookies with identifier cookie as name represent only 0.87% of the total number of cookies. Therefore, we will exclude them from our study.
PrivacyBadger only checks the entropy of cookie values, but not the keys. I wrote a PoC tracker that can exploit this bug by storing tracking ID's in the key.
To determine if this is being used in the wild, I wrote a chrome extension that scraped the Alexa top 1000 sites, logging each third-party cookie that that escaped notice at this point. I repeated this scraping three times (result sets 1, 2, 3), each time starting with a clean browser with no cookies. If an ad-server assigns a new tracking-id to every clean client, then that should show as two different cookies. Sure enough, from sets two and three, we have:
If the "id" prefix didn't give away it's purpose, 16 bytes is certainly enough entropy for a tracking cookie (ctnsnet.com is a Russian IP with DNS whois privacy, and visiting that url simply returns an error).
There may be other trackers, that this test didn't catch; perhaps they only appear once logged into many of these services. Or perhaps by different user-agent/IP fingerprint. Real-world data may show more.
Solutions