Key-based cookie tracking

roguism commented 8 years ago

PrivacyBadger only checks the entropy of cookie values, but not the keys. I wrote a PoC tracker that can exploit this bug by storing tracking ID's in the key.

To determine if this is being used in the wild, I wrote a chrome extension that scraped the Alexa top 1000 sites, logging each third-party cookie that that escaped notice at this point. I repeated this scraping three times (result sets 1, 2, 3), each time starting with a clean browser with no cookies. If an ad-server assigns a new tracking-id to every clean client, then that should show as two different cookies. Sure enough, from sets two and three, we have:

"ctnsnet.com" : [
    "opt=0",
    " cid_dbc00370ddb64184b4f5f0719cbf9281=1"
]

"ctnsnet.com" : [
    "opt=0",
    " cid_f22a8562235a4abbbb32fb41638ec49d=1"
]

If the "id" prefix didn't give away it's purpose, 16 bytes is certainly enough entropy for a tracking cookie (ctnsnet.com is a Russian IP with DNS whois privacy, and visiting that url simply returns an error).

There may be other trackers, that this test didn't catch; perhaps they only appear once logged into many of these services. Or perhaps by different user-agent/IP fingerprint. Real-world data may show more.

Solutions

Of course the simplest solution is to blacklist ctnsnet.com, however this doesn't cover future cases. Is there an algorithmic approach?
One could attempt to apply the current approach with values to the keys, however that seems infeasible. There there is large amount of variability across the web for key names, too large to enumerate. Although most benign websites will have fixed cookie keys, each website developer will pick their own. Browsing through the result sets demonstrates this.
Ultimately, the client must learn how much entropy is in the cookies from a given third party, in keys and values. Entropy has no meaning in isolation, but must be understood in the context of what other cookies that server is distributing to other clients. This implies sharing cookies with other clients, either through a trusted third party or in a decentralized manner. However, the former involves an unsettling amount of trust in the third party (capable of revealing the browsing patterns of every user), and the latter has similar issues. While I believe the decentralized solution could possibly be made secure with some hashing and crypto-magic, such a solution would be a massive and failure-prone undertaking for addressing what is currently only evidenced in one site.
The best solution may be to become the second client you're comparing your cookies against. This clear-cache/re-request/compare method is what revealed the tracker anyway. Caution should be exercised in the timing of the re-request; a smart tracking server could reissue the same tracking cookie when requests from the same user-agent/IP in a short time frame. Because PB makes no assurances about tracking during the initial training window, we are free to send this comparison request at our leisure, preferably at some random interval.

cooperq commented 8 years ago

This is excellent work! Give me some time to digest everything you have said here, you have clearly put a lot of thought into it. We did some experiments in the vein of suggestion 2 but as you predicted it caused a fair number of false positives without much of a privacy gain, I will read through your other suggestions.

roguism commented 8 years ago

Thanks, a couple further notes:

To clarify solution 3, by crypto-magic I was considering homomorphic encryption and/or secure multi-party computation, which was melting my brain. It's fascinating and I believe it's possible, but it's still a rather academic field, so perhaps we shouldn't be the ones breaking ground here. (I also read Jonathan Mayer's paper and section II.A is essentially what I was getting at.)
I'd like to amend suggestion 4; rather than PB initiating requests after a random time frame, instead piggyback on the user's requests. That is, for the first 3 requests to a given third-party, have PB remember SET-COOKIE, but block it from chrome. After 3 requests, compare; if all same, allow cookies through, if any different, block permanently.
However, a travelling businessman might have some false positives with approach 4. For example; first request from America "lang=en", second request from Germany "lang=de", third request from Japan "lang=jp". We could avoid this false positive by applying the current approach to cookie values, and approach 4 only to keys. (Assuming all benign developers use stable keys.)

cooperq commented 8 years ago

So I think that the fourth approach could be made to work for keys and values if we also keep a whitelist of low entropy cookie values (which we currently have, and it includes language codes). I think it's safe to assume that benign developers will not change cookie names very frequently. I do however still think that we should make some estimate of the entropy of a potential tracking cookie before we mark it as tracking. For example, a cookie could hit the following values {3, 4, 5} and still not be useful as a tracking cookie and so we would not want to mark it as such.

ghostwords commented 5 years ago

From Tracking the Pixels: Detecting Unknown Web Trackers via Analysing Invisible Pixels, a crawl of 8,744 top Alexa domains:

The cookies with identifier cookie as name represent only 0.87% of the total number of cookies. Therefore, we will exclude them from our study.

EFForg / privacybadger

Key-based cookie tracking #642

Solutions