The proxy can learn when a website has been visited

vtoubiana commented 3 years ago

Hi,

My understanding is that when the browser already has cookies for a given domain, it'll block any attempt to prefetch resources from that domain. Hence, if an entity controls:

The prediction algorithm
The visited website
The proxy, It can learn that a website has already been visited by the browser if it sees no prefetch request for the most probable prefetched candidate. For instance on Google Search, that would most likely be the top search result. Hence, if knowing the visited search result page, the proxy sees no request for the top search result, it can learn that it has cookies stored in the browser. My understanding is that it's how search result prefetching works since chrome 87 : https://www.google.com/chrome/privacy/whitepaper.html#netpredict

An other problematic case is when an entity being able to force the browser to visit a search result page also has the capacity to monitor traffic (e.g. an hostel access point being able to "pop-under the login page" some Google search result pages and then listen to the traffic). If the entity sees no traffic to the connect proxy, it can assume that the browser blocked the request because it already had a resource for that domain.

I don't know if I'm clear enough and/or if I missed something in the way this work.

Best regards,

Vincent

KenjiBaheux commented 3 years ago

Hi @vtoubiana

Thanks for the feedback, we'll look into these scenarios carefully and explore potential options. We'll update this issue as soon as possible.

One thing I would like to clarify:

My understanding is that it's how search result prefetching works since chrome 87 : https://www.google.com/chrome/privacy/whitepaper.html#netpredict

The "Privacy-preserving search result link prefetching" feature is not launched yet. Through this github repo and engagement with the community, we hope to refine and generalize the proposal. So, thank you again for playing a part by sharing such thoughtful feedback!

vtoubiana commented 3 years ago

Hi @KenjiBaheux ,

Thank you for your answer.

Regarding the fact that the feature is not launched, maybe the Privacy Whitepaper should be updated to reflect that. Indeed, it mentions in the first paragraph of the document that "This document does not cover features that are still under development, such as features in the beta, dev and canary channel and active field trials, or Android apps on Chrome OS if Play Apps are enabled. "

That being said, I'm glade that it was mentioned so that we can start this conversation :).

Best,

Vincent

KenjiBaheux commented 3 years ago

Sorry for the delay and thanks again Vincent for sharing these scenarios.

Here is our thinking about how to address the underlying concern:

There are a few reasons why the signal would be inherently noisy, from opt-outs to user agent heuristics designed to balance tradeoffs. On the latter, the tradeoffs span data usage, battery usage, prioritizing main page activity, value of prefetching a given page, historical data about a referrer’s prefetching hit/rate, etc.
That said we can further decrease the S/N ratio, by adding some amount of credential-less prefetches of main resources even when cookies are present. In fact, we had already implemented this for sub-resources. We recently landed the necessary changes to the WIP implementation to extend this to main resources.
Eventually, we also want to help developers take advantage of privacy-preserving prefetching even for links with cookies. Our current plan consists of an opt-in API for developers to communicate that their pages are capable of handling an upgrade from credential-less prefetches to with-credentials context on navigation. This would further decrease the S/N ratio.

Let us know if you have any further concerns.

ghost commented 3 years ago

I always thought I understood how private prefetch proxy works in Chrome. With the comment above, I'm confused again :)

There are a few reasons why the signal would be inherently noisy, from opt-outs to user agent heuristics designed to balance tradeoffs. On the latter, the tradeoffs span data usage, battery usage, prioritizing main page activity, value of prefetching a given page, historical data about a referrer’s prefetching hit/rate, etc.

Can you explain how that would help directly or indirectly with privacy? e.g., lets say google.com inserts prefetches for foo.com and bar.com (in that sequence), then if bar.com is fetched but foo.com is not, then it's a clear signal to Google that the user has visited foo.com in the past. Are you suggesting that the browser's heuristics around data usage or battery may somehow result in bar.com prefetched but not foo.com?

That said we can further decrease the S/N ratio, by adding some amount of credential-less prefetches of main resources even when cookies are present. In fact, we had already implemented this for sub-resources. We recently landed the necessary changes to the WIP implementation to extend this to main resources.

I think adding noise is definitely helpful, but I think it only handles a small subset of the problems. Lets say Google inserts prefetches for foo.com, bar.com, baz.com; user has cookies for foo.com and Chrome is configured to do 2 prefetches. For this example, Chrome would send credential-less prefetch for foo.com, and it would still prefetch bar.com and baz.com. The prefetch for baz.com along with information that Chrome is configured to do 2 prefetches is sufficient to reveal to Google that the user has browsed foo.com in the past even though Chrome did a noisy prefetch for foo.com.

This also seems vulnerable to repeated attacks: Google can try the prefetch multiple times, and a single instance of missing prefetch alone would be sufficient to leak the user’s browsing history.

Eventually, we also want to help developers take advantage of privacy-preserving prefetching even for links with cookies. Our current plan consists of an opt-in API for developers to communicate that their pages are capable of handling an upgrade from credential-less prefetches to with-credentials context on navigation. This would further decrease the S/N ratio.

That makes sense. Would be good to still make the protocol private in the meantime.

I think the feature would benefit from the PING privacy review. There are other Chrome features that are already under review there (https://lists.w3.org/Archives/Public/public-privacy/2021JanMar/thread.html).

ghost commented 3 years ago

Are there any updates here? Did you get a chance to take an in-depth look at this?

buettner commented 3 years ago

Sorry for the long delay!

There are two distinct scenarios and it’s helpful to consider them separately. In one case, the attacker controls both the speculation rules on the website and also the network the user is on (i.e., the evil hostel scenario in the first post). In the other case, the concern is that the referring website operator and the proxy operator collude to learn the user history, with the Google scenario from the first post being a specific case.

In the former, the user agent (i.e., Chrome) can’t know if the user is being exposed to the evil hostel attack, which means the user agent needs to have robust countermeasures. In our initial deployment model, we will not allow other websites to initiate private prefetches while we gauge interest from the community and iterate on a viable model. Before moving forward with opening up the proxy to third-party sites, we’ll need to address the evil hostel concern.

In the latter scenario where Google controls the speculation rules and the proxy, the user agent has the benefit that it can know exactly how the data is being used. The Chrome whitepaper and privacy policy will describe how information learned by the proxy is used before launch (it takes some time for those to be published), and we will follow that description. In the meantime before those documents are published, this issue describes our initial experiment. It clarifies that no user identifier is sent on requests to the proxy, and that any information learned by the proxy is used solely to facilitate anonymous prefetching and is not linked to other information from the user’s Google account.

With that said, the upcoming experiment will give us a better understanding of the bandwidth vs performance improvement tradeoff. This knowledge will help inform our discussions about mechanisms to mitigate this theoretical side-channel. The concern we’re trying to balance is that sending prefetch requests when we know we can’t use the response consumes both user and publisher bandwidth, but does not improve the user experience.

buettner commented 2 years ago

Based on our experiments, prefetching introduced much less than 1% of additional network traffic. We are now making prefetch requests, without cookies, in all cases where Google doesn't already know if the user has visited the site before (e.g., users with Sync enabled). If the user did have a cookie but we did not send it, we will not use the prefetched resource as we can't know if it would be different if we had sent the cookie.

RobertMiller23 commented 2 years ago

We're exploring using the Chrome's private prefetch proxy (See discussion here). However, it's unclear to us if we need to update our site's privacy policy to account for Google proxy learning about user's past visits. Is there any guidance that you can provide to web developers?

buettner commented 2 years ago

The proxy does not learn about user's past visits. The Google proxy does not learn anything that Google does not already know from serving the search results to the user.

buettner / private-prefetch-proxy

The proxy can learn when a website has been visited #7