antifraudcg / proposals

Proposals for the Anti-Fraud Community Group.
24 stars 5 forks source link

Cross-Partition Cookie Age Signal for Abuse Detection #9

Open philippp opened 2 years ago

philippp commented 2 years ago

We would like to propose a new API for the browser to reveal a bucketized time interval since any cookies for the inquiring origin were reset. The goal of the API is to provide a low entropy signal that can be useful for identifying deceptive clients that reset their partitioned state in order to appear as a multitude of distinct clients.

The ability to differentiate between unique users and overactive clients is paramount for fighting online fraud and abuse, such as DoS attacks and invalid traffic. Third-party cookies currently provide anti-abuse systems with a simple way of uniquely identifying users across the web. For example, this ability allows us to determine if there is unusual activity (multiple requests or clicks) associated with a single cookie which could be interpreted as abuse.

Given the ease that third-party cookies provide for distinguishing unique user activity, it is natural for bad actors to avoid detection by clearing their cookies. Alternatively, they can use multiple bots to perform the same type of abuse. In doing so however, bad actors allow service providers to detect abuse using a different signal: the cookie age. Indeed, to successfully conduct a large-scale attack, the cookie age observed on the traffic generated by an abuser will tend to follow a very different pattern than that of regular traffic, thereby allowing abuse detection organizations to cluster abnormal behavior using this signal.

With the forthcoming deprecation of third-party cookies, abuse detection organizations will lose the ability to uniquely identify users and to analyze clusters of cookie age. One possible solution to recover the cookie age signal is the use of the Trust Token API, which allows issuers to encode a cross site signal of ~2.58 bits. The main limitation of this technology is that browsers must impose a limit on the number of tokens that can be redeemed on a website. This is due to the fact that each token contains ~2.58 bits of information, so the use of multiple tokens could be used as a cross-site fingerprint.

Given the considerations and constraints discussed above, we believe that age signals will allow partitioned cookies to be a crucial component for future abuse detection systems. This document proposes for Chrome to provide a CookieAge API that would encode the time since a user reset any cookies associated with the inquiring origin. The CookieAge API would return a low entropy representation of the age of relevant cookies. This could for instance be a bucketized age in N buckets, where N would be small enough to ensure that the signal cannot be used as a fingerprint.

What properties should the partitioned cookie age age have?

Based on the above considerations, we think the cookie age should encode the time since a browser set its first cookie associated with the inquiring origin, and should reset to zero whenever any cookie associated with the inquiring origin is reset.

bmayd commented 2 years ago

We would like to propose a new API for the browser to reveal a bucketized time interval since any cookies for the inquiring origin were reset.

Please let me know if I’m understanding this correctly:

A browser requests a URL from a.example with no cookies, so a.example sets a new cookie. The a.example cookie is cleared from the browser which makes a subsequent request of a.example, which again sets a new a.example cookie.

This proposal seeks to signal: “If this browser has no a.example cookie, did it have one previously and if so, how long ago was it cleared?”

Assuming that's correct, is the idea something like:

philippp commented 2 years ago

Brian,

Your understanding as stated in the text up to the bullet points is correct. We're not settled on what a concrete implementation would look like (e.g. whether this is relayed via a header, or a queryable browser API), and would likely report a bucketed age "e.g. cookies were last cleared <1 day ago" instead of a timestamp. Beyond that bullets 1 and 2 are in line with our expectations. Whether the cookie-cleared value is reset when a new cookie is set is still undecided - there may be value in persisting the "time since cookies were last cleared" value, and only resetting it on the next cookie-clearing event.

bakkot commented 2 years ago

Since in this scenario the attacker is in control of the browser, how can you trust this signal? With cookies there's generally more information beyond just the age (e.g. a MAC'd user ID) which the site can check against other claimed attributes (user agent, etc), so you can't just fake having an old cookie. But if this API only exposes the age, what's to prevent the attacker from modifying the browser to claim that they have lots of old cookies from third parties without even having interacted with the third party? How could a website tell they were lying about that?

philippp commented 2 years ago

Kevin, you are correct that the same attack is still possible by modifying the browser.

  1. Is there value in preventing the attack on unmodified browsers?
  2. Are there (widely available) alternatives that should be considered?

I (perhaps naively) wonder whether this could dovetail with https://github.com/antifraudcg/proposals/issues/8: We start by developing widely available but more brittle signals, so defender teams can start to understand the distributions of benign traffic and develop filters. Naive attacks will get flagged, and once a high-confidence signal becomes available for a platform, defenders can immediately make use of it.

supanate7 commented 2 years ago

That attack might also be made more difficult to successfully execute by using Trust Tokens from the third parties (thus forcing the attacker to either interact with the 3rd party or steal the necessary Trust Tokens from a user). Also, depending on the Trust Token issuance logic used by the 3rd party, there might also be some frequency capping and token expiration benefits.

npdoty commented 2 years ago

I'm not sure I understand the proposal. Is this a straight browser API to indicate the last time the user cleared all their cookies? That would apparently be trivially controlled by the fraudulent attacker; that is, it would never be a high-confidence signal.

Or is it adding additional entropy to the Trust Tokens (or similar) API to encode not just that a user has a valid token and the age of the token but also some private metadata about how old that user's cookies are for the issuer? That doesn't seem like it needs a browser API at all; the third-party issuer knows how old the user's cookies are, or when they last interacted with them. I can see that many would like to have more detailed encoding of user information in encrypted metadata sent through trust tokens, but it seems like a significant additional privacy loss for the user.

philippp commented 2 years ago

Is this a straight browser API to indicate the last time the user cleared all their cookies? That would apparently be trivially controlled by the fraudulent attacker; that is, it would never be a high-confidence signal.

Yes, this would be a 'straight browser API' that would indicate when the user last cleared (or modified) cookies that belong to the inquiring domain. This is a brittle signal, however it closes a loophole in which a single attacker can use common browser UI (clearing cookies) to appear as a number of new visitors to an embedded instance of a site.

The only net-new information that would be introduced would be a bucketed time since the user last cleared the cookies associated with a domain, and this would be readable only by the domain itself. Unlike the age of a cookie, this is effectively the "age of the absence of a cookie." Assuming that the biggest time bucket is ">=1day," the privacy impact would be that after clearing cookies for a site, any future visit during the same day would inform the site that you had cleared some cookies for the site that day.

In exchange, the site would know that you are not a net-new users, and would be able to more accurately measure its audience in both 1P and 3P contexts.

Can you elaborate on your privacy concerns and what scenarios may be especially problematic?

bmayd commented 2 years ago

In exchange, the site would know that you are not a net-new users, and would be able to more accurately measure its audience in both 1P and 3P contexts.

Correct me if I'm misstating: the site would know that some browser on which it previously set cookies for its domain at some time in the past had cleared those cookies within a given time range (e.g. the last day). There is a possibility that it could be used to identify a specific browser (for example if only one browser ever visited a domain previously it would be the only browser that could have cleared its cookies), but seems unlikely in real-world scenarios.

@philippp I think it might be useful to allow a domain to know if cookies had been reset within the browser's calendar day vs a static time range, is that something that could be supported by this?

npdoty commented 2 years ago

We don't have specifics of an API yet so I can't give a full privacy review, but my initial concerns would be:

bmayd commented 2 years ago

I'd want to give more thought to most of these before responding, but I agree this one is a reasonable concern:

indicating when cookies were last cleared could reveal past browsing history, both to sites and to local attackers (browsers would have to keep logs of sites visited beyond when users specifically chose to clear history)

Regarding revealing something about history to sites -- yup it is the point of the feature to reveal to sites that they have interacted with the browser before. I don't have a good sense of how revealing it might be, but clearly it is more than nothing.

I find the latter part, having browsers maintain a list of sites for which cookies have been deleted, more concerning; I think it would be reasonable to expect that when cookies are cleared, references to the sites associated with them are not left behind.

A possible mitigation would be to have the browser maintain a list of hashes of the eTLD+1 of sites for which cookies had been deleted. This would still allow a means of checking if cookies were cleared by asking the API to hash and check the current domain, but someone gaining access to the list wouldn't learn anything useful. The hashing could be made unique to the browser instance as well, so comparing lists of hashes also wouldn't reveal anything.

dvorak42 commented 2 years ago

We'll be briefly (3-5 minutes) going through open proposals at the Anti-Fraud CG meeting this week. If you have a short 2-4 sentence summary/slide you'd like the chairs to use when representing the proposal, please attach it to this issue otherwise the chairs will give a brief overview based on the initial post.

philippp commented 2 years ago

I'll summarize it as:

"Services embedded in long-tail sites cannot distinguish between a genuine new visitor and someone seeking to inflate metrics by repeatedly clearing cookies. As the browser tracks the cookie clearing state, this indicator would have low confidence unless the mutation of the indicator can be made non-trivial.

As such, it may make sense to revisit the proposal at a time when such hardening is possible."

dvorak42 commented 2 years ago

From the CG meeting, there was some discussion if we can find specific classes of problems that could be solved with the lower confidence version of this, but otherwise without some sort of proof of the age, this does seem less useful and could potentially be revisited later.