EFForg / privacybadger

Privacy Badger is a browser extension that automatically learns to block invisible trackers.
https://privacybadger.org
Other
3.13k stars 381 forks source link

Key-based cookie tracking #642

Open roguism opened 8 years ago

roguism commented 8 years ago

PrivacyBadger only checks the entropy of cookie values, but not the keys. I wrote a PoC tracker that can exploit this bug by storing tracking ID's in the key.

To determine if this is being used in the wild, I wrote a chrome extension that scraped the Alexa top 1000 sites, logging each third-party cookie that that escaped notice at this point. I repeated this scraping three times (result sets 1, 2, 3), each time starting with a clean browser with no cookies. If an ad-server assigns a new tracking-id to every clean client, then that should show as two different cookies. Sure enough, from sets two and three, we have:

"ctnsnet.com" : [
    "opt=0",
    " cid_dbc00370ddb64184b4f5f0719cbf9281=1"
]

"ctnsnet.com" : [
    "opt=0",
    " cid_f22a8562235a4abbbb32fb41638ec49d=1"
]

If the "id" prefix didn't give away it's purpose, 16 bytes is certainly enough entropy for a tracking cookie (ctnsnet.com is a Russian IP with DNS whois privacy, and visiting that url simply returns an error).

There may be other trackers, that this test didn't catch; perhaps they only appear once logged into many of these services. Or perhaps by different user-agent/IP fingerprint. Real-world data may show more.

Solutions

  1. Of course the simplest solution is to blacklist ctnsnet.com, however this doesn't cover future cases. Is there an algorithmic approach?
  2. One could attempt to apply the current approach with values to the keys, however that seems infeasible. There there is large amount of variability across the web for key names, too large to enumerate. Although most benign websites will have fixed cookie keys, each website developer will pick their own. Browsing through the result sets demonstrates this.
  3. Ultimately, the client must learn how much entropy is in the cookies from a given third party, in keys and values. Entropy has no meaning in isolation, but must be understood in the context of what other cookies that server is distributing to other clients. This implies sharing cookies with other clients, either through a trusted third party or in a decentralized manner. However, the former involves an unsettling amount of trust in the third party (capable of revealing the browsing patterns of every user), and the latter has similar issues. While I believe the decentralized solution could possibly be made secure with some hashing and crypto-magic, such a solution would be a massive and failure-prone undertaking for addressing what is currently only evidenced in one site.
  4. The best solution may be to become the second client you're comparing your cookies against. This clear-cache/re-request/compare method is what revealed the tracker anyway. Caution should be exercised in the timing of the re-request; a smart tracking server could reissue the same tracking cookie when requests from the same user-agent/IP in a short time frame. Because PB makes no assurances about tracking during the initial training window, we are free to send this comparison request at our leisure, preferably at some random interval.
cooperq commented 8 years ago

This is excellent work! Give me some time to digest everything you have said here, you have clearly put a lot of thought into it. We did some experiments in the vein of suggestion 2 but as you predicted it caused a fair number of false positives without much of a privacy gain, I will read through your other suggestions.

roguism commented 8 years ago

Thanks, a couple further notes:

cooperq commented 8 years ago

So I think that the fourth approach could be made to work for keys and values if we also keep a whitelist of low entropy cookie values (which we currently have, and it includes language codes). I think it's safe to assume that benign developers will not change cookie names very frequently. I do however still think that we should make some estimate of the entropy of a potential tracking cookie before we mark it as tracking. For example, a cookie could hit the following values {3, 4, 5} and still not be useful as a tracking cookie and so we would not want to mark it as such.

ghostwords commented 5 years ago

From Tracking the Pixels: Detecting Unknown Web Trackers via Analysing Invisible Pixels, a crawl of 8,744 top Alexa domains:

The cookies with identifier cookie as name represent only 0.87% of the total number of cookies. Therefore, we will exclude them from our study.