buettner / private-prefetch-proxy

Proposal to use a CONNECT proxy to obfuscate the user IP address for privacy-enhanced prefetching.
31 stars 6 forks source link

Deploying Chrome’s Private Prefetch Proxy #15

Closed buettner closed 5 months ago

buettner commented 3 years ago

Deploying Chrome’s Private Prefetch Proxy

We’re beginning to experiment with a Private Prefetch Proxy for Chrome on Android. More information can be found here.

Initially, the proxy will only be available for prefetches initiated by Google Search using Speculation Rules. The reason for this initial restricted scope is that the proxy is run by Google, and to remain compatible with the user’s privacy expectations when visiting a website, the proxy can only receive information about URLs on Google properties (which Google inherently knows about). As a reminder, no user identifier is sent on requests to the proxy, and any information learned by the proxy is used solely to facilitate anonymous prefetching and is not linked to other information from your Google account.

User Opt-out

Users can opt out of Google Search result prefetching via Chrome’s Preload setting. The feature will also be disabled for pages loaded in Incognito mode.

Publisher Opt-out

Some publishers may not want their links prefetched. We give them two ways to opt-out:

  1. All prefetches carry the “Purpose: prefetch” request header. Publishers can look for this header and reject the request as needed.
  2. The proxy also fetches the origin’s traffic advice, and will stop prefetching URLs if directed to do so.

Network Administrators

Network administrators can control how the feature works on their network, as described here. The purpose-specific domain name used to trigger navigation-time DNS resolution is 'dns-tunnel-check.googlezip.net'.

Expanding Beyond Google Search

We think that the opportunity to speed up cross origin navigations would be appealing to many websites and that the resulting low friction discovery experiences would benefit users and the web. Because of these beliefs, we aspire to make this feature available to all websites.

However, in this case, the proxy would learn the host names of links on non-Google websites, which requires user notice and control. We are considering adding a one-time user opt-in by which users can inform Chrome that they would like to prefetch from non-Google sites via the proxy.

Before we move forward with this proposal, we’d like to discuss the following aspects with the community:

  1. Confirm the level of interest in the feature from other sites on the web.
  2. Confirm that interested parties are comfortable with the proposed opt-in model.
pgl commented 3 years ago

Network administrators can control how the feature works on their network, as described here.

What is the purpose-specific domain?

Also will the proxies be configured as domains or IP addresses? Local control may be preferable for users via domain blocking.

buettner commented 3 years ago

Sorry about that. I added the domain to the main post (dns-tunnel-check.googlezip.net).

The proxy is configured as a domain -- tunnel.googlezip.net.

buettner commented 2 years ago

Tl;dr

We will soon start an opt-in Early Access Program with the goal of helping publishers evaluate the technology and provide feedback to inform our plans.

Early Access Program for Chrome’s Private Prefetch Proxy

To help interested publishers assess this feature and provide feedback, we’ll soon begin an opt-in Early Access Program (EAP) for Chrome’s Private Prefetch Proxy on Android. During the EAP, this feature will be trialed only on Google Search to prefetch links to websites participating in the EAP.

User Opt-out

Users can always opt out of Google Search result prefetching via Chrome’s Preload setting. The feature will also be disabled for pages loaded in Incognito mode. From our most recent experiment, we found that the byte overhead from unused prefetches was far less than 1% of overall user traffic.

Publisher EAP Opt-in

Interested publishers will need to indicate their desire to participate in the EAP by creating a traffic advice file that includes a dedicated EAP field (google_prefetch_proxy_eap).

Example:

[
    {
        "user_agent": "prefetch-proxy",
        "google_prefetch_proxy_eap": { "fraction": 1.0 }
    }
]

fraction should be a value between 0.0 and 1.0 (i.e. 0% to 100%). This field controls the fraction of requested prefetches that the Private Prefetch Proxy will send to the destination site (the remainder will be dropped by the proxy). EAP participants may want to start with a smaller fraction (e.g. 0.1), to monitor their key metrics, and gradually ramp up to 1.0.

We recommend that interested publishers join this announcement mailing list to receive key updates about the EAP (e.g. starting, gradual rollout, observations on our end). For support inquiries, we’ve created this support mailing list.

Impact

From our most recent experiment, we observed that the vast majority of websites saw less than a 2% increase in main HTML fetches, and a 20+% faster LCP when a prefetched resource was used. (Note that this feature is Android only, so the overall increase in traffic is much lower.)

Geo-blocking

For the EAP, we will only have egress IPs in a few countries, and users are mapped to the IP range that is closest to their ingress IP.

We recommend one of the following approaches:

  1. If the content subject to geo-locking is hosted on a specific origin, you can limit your participation to the EAP on the origins that are known to be free of any geo-blocking. This can be done by placing the necessary traffic advice on the geo-blocking free origins, and doing nothing for the origins with geo-blocking.
  2. Prefetch requests from the proxy can be identified by checking for the Purpose: prefetch header, and doing a reverse DNS lookup on the IP addresses. The proxy IP addresses will resolve to XYZ.fetch.tunnel.googlezip.net where XYZ depends on the specific IP address. Publishers can reject the prefetch request if it is trying to access content subject to geo-blocking.

Network Administrators

Network administrators can always control how the feature works on their network, as described here.

RobertMiller23 commented 2 years ago

This is great news. I've started looking into how we can enroll in this early access program for our web frontend. However, before I start, few questions below:

During the EAP, this feature will be trialed only on Google Search to prefetch links to websites participating in the EAP.

Would the proxy share user's cookies with our web frontend? If not, then we run into the risk of showing incorrect content to the user and annoying them. How do we avoid that on our web frontend?

From our most recent experiment, we observed that the vast majority of websites saw less than a 2% increase in main HTML fetches, and a 20+% faster LCP when a prefetched resource was used.

Would it be possible for you to provide data on how much was the average performance improvement as measured using CWV metrics? My understanding is that the 2% number corresponds to all HTML fetches to a specific site. However, the 20% number refers to only the successful prefetches? The two numbers are evaluated over very different datasets which makes it harder to make any tradeoff decisions. We're worried about the prefetch costs, so any data related to average CWV improvement would help us make a better case to the top brass.

PS: I'm using a personal github account because I do not yet want my comments to be associated with my employer.

buettner commented 2 years ago

Happy to hear you're interested!

Would the proxy share user's cookies with our web frontend? If not, then we run into the risk of showing incorrect content to the user and annoying them. How do we avoid that on our web frontend?

User cookies will never be sent on prefetch requests for privacy reasons. For correctness reasons, Chrome can't naively use a prefetched resource if it should have had a cookie on the request (as you mentioned). This means that prefetching is only effective when the user does not have a cookie for the origin, which is common for cross-origin navigations. Moreover, a navigation without a cookie is often the user's first visit to the site, which tends to be slower than average as the user has no cached resources. I.e., speeding up first-visits is often more important than speeding up subsequent visits.

However, we do have a proposal that allows sites to tell Chrome that the HTML is not dynamically generated based on the cookie and is safe to use even if prefetched without a cookie. Once the user navigates to the page, cookies will be sent on subsequent requests.

If this proposal might work for you, we'd be very interested in hearing your feedback!

Would it be possible for you to provide data on how much was the average performance improvement as measured using CWV metrics?

On average, LCP improved by ~3%. Though we hope to improve this with better triggering.

RobertMiller23 commented 2 years ago

On average, LCP improved by ~3%. Though we hope to improve this with better triggering.

From our most recent experiment, we observed that the vast majority of websites saw less than a 2% increase in main HTML fetches, and a 20+% faster LCP when a prefetched resource was used.

Thanks for the quick reply and thanks for explaining. It's important for us to get the details right so we can make the right tradeoffs among the engineering, traffic costs and CWV gains.

If the increase in main HTML fetches is 2%, then even with the assumption of 100% precision, the prefetch should speed up at most 2% of the webpages. Even if we optimistically assume speedup of 100% (instead of the actual 20% speedup) for those 2% page loads, that translates to an average of 2% LCP improvement. In practice with lower precision and 20% speedup (instead of 100%), the LCP improvement should be much lower. What am I missing?

Chrome can't naively use a prefetched resource if it should have had a cookie on the request (as you mentioned). This means that prefetching is only effective when the user does not have a cookie for the origin, which is common for cross-origin navigations.

we do have a proposal that allows sites to tell Chrome that the HTML is not dynamically generated

Does this mean that Chrome will be prefetching in many cases but not actually using up the prefetched resource? Does that further lower down the precision to be useful only once per Chrome install unless we rewrite our frontend?

buettner commented 2 years ago

What am I missing?

Sorry, that was my fault. I still didn't give you numbers across the same populations.

3% is improvement on LCP for all navigations coming from Google Search.

The challenge is that aggregate performance impact will vary dramatically across sites, depending on how much of their traffic comes from Search. Some sites with rich content primarily have same-origin navigations, whereas others are primarily landing pages that get much of their traffic from Search. Also, some sites value the navigations from Search (users discovering their site) more highly than subsequent user navigations.

The question we primarily wanted to answer was how much additional traffic will this impose on users, ISPs, and origins. The answer is that it's not much in aggregate, and it's very rare for any site to see a large increase in requests.

This is one of the purposes of the EAP -- it gives sites a way to slowly ramp up prefetch traffic while evaluating their own overhead/performance metrics.

Does this mean that Chrome will be prefetching in many cases but not actually using up the prefetched resource?

Yes. Though we have plans to reduce this additional overhead. If you want the performance improvement for users who have visited your site before (assuming you set a cookie when they do), that will require changes on your frontend. Potentially, this could be as simple as adding the 'Supports-Loading-Mode: uncredentialed-prefetch' header (note: this is not yet implemented, but we can prioritize it if there is broad interest). But it depends on your site and how you use cookies.

RobertMiller23 commented 2 years ago

Thanks @buettner. Looking at our server logs, we get ~8% of Chrome traffic from Google, and rest from users clicking on links etc. I wrote a quick simulation to measure the CWV impact if we speed up 8% of page loads by 3%. My simulation shows about ~0.2% reduction in LCP. Does that sound reasonable to you?

Do you have improvement numbers for other CWV metrics? e.g., First Input Delay or CLS?

buettner commented 2 years ago

That sounds reasonable. Though, your results may vary, as the impact is not consistent across sites.

We did not see significant changes in CLS or FID.

I'm happy to hear you're interested in the EAP! We'll keep you posted on timing via the mailing list.

gladenjoy commented 2 years ago

I'm giving feedback because I had a problem when I delivered the traffice advice file and opt-in to EAP to do a prefetch from Google search.

Some of the pages are geo-restricted by IP, so the pages will return status code 403 if Purpose: prefetch is in the request header, as documented below.

Publishers should look for the Purpose: prefetch request header and respond with an HTTP 403 (Forbidden) (see Geolocation for an example use case). https://github.com/buettner/private-prefetch-proxy#publisher-opt-out

If Chrome's prefetch setting is enabled, when you type a URL directly into the address bar, not just from Google search, it will be prefetched and the Purpose: prefetch will be included in the request header. https://support.google.com/chrome/answer/1385029?hl=en&co=GENIE.Platform%3DAndroid&oco=2

We have confirmed that this causes a 403 page to be displayed when the URL is entered into the address bar. I guess I need to do a DNS lookup and look at the address as described in this issue.

jeremyroman commented 2 years ago

Filed https://bugs.chromium.org/p/chromium/issues/detail?id=1284708 to investigate this on the Chromium side.

buettner commented 2 years ago

Thanks for the feedback!

It seems like showing the error page is a bug, and we're following up on the bug jeremyroman filed.

Before launch, country-level IP geolocation will work as expected. But during the EAP, and in the future if you need finer granularity, the DNS lookup of the address is currently the only way to determine if the prefetch is from the proxy. However, we are hoping to make this detection easier in the future. We will update here when we have more details.

buettner commented 2 years ago

The Early Access Program (EAP) is now live. If you are a publisher and have opted-in to the program by adding a traffic-advice file (described here), you should start seeing traffic from Chrome versions M97 and higher.

If you haven’t opted-in yet but are interested in the feature, please consider joining the program (and join the update mailing list)!

If you have any questions, don't hesitate to reach out on the support mailing list.

We look forward to your feedback!

spelchat commented 2 years ago

Note that missing from the description above is that the .well-known/traffic-advice response must have an application/trafficadvice+json MIME type (set via the Content-Type header) as mentioned of the traffic-advice proposal.

jeremyroman commented 2 years ago

For convenience, https://traffic-advice-checkup.netlify.app/ can be used as a quick diagnostic to catch some simple errors and summarize the expected behavior of the private prefetch proxy. (It's a separate app, but written with reference to the actual prefetch proxy source.)

buettner commented 2 years ago

Over the coming weeks, we will begin rolling out the “private prefetch proxy” feature for Chrome 103 on Android. This feature results in a median 30% LCP improvement when a prefetch is used. While the majority of sites will see less than a 2% increase in HTML fetches (and a much lower increase in overall bytes, as images and other resources are not prefetched), if you wish to limit the amount of prefetch traffic sent to your site, you may need to update your traffic-advice file. In particular, if you’ve specified the "google_prefetch_proxy_eap" parameter it will need to be replaced with the “fraction” parameter.

E.g.,

[
    {
        "user_agent": "prefetch-proxy",
        "fraction": 0.5
    }
]
ravau commented 1 year ago

Very nice tech, my question is - as I updated my traffic advice I wanted to check server logs if google search was requesting the traffic advice file in the past, but it didnt, why is that? Site traffic is PL/EU.

buettner commented 1 year ago

The traffic advice file is fetched only when Chrome attempts to prefetch a page from the site.

If you see requests to your site via the proxy, then you should also see a fetch for the traffic advice file.

vanders2023 commented 1 year ago

How do you stop fetch.tunnel.googlezip.net? I set fraction to 0 in .well-known/traffic-advice, but it still continues to access my site. I ended up returning errors 401 on all accesses from IPs listed here : https://www.gstatic.com/chrome/prefetchproxy/prefetch_proxy_geofeed It has been two months and it continues to access my site consuming bandwidth and filling up my error logs. For example, yesterday I had 108,492 normal accesses and 247,885 accesses blocked from those IPs.

buettner commented 1 year ago

Sorry you're having this problem.

Looking at your traffic advice file, it looks like you're still using the Early Access Program token.

Can you update the traffic advice file to align with the one in this comment?

Also, you can check the expected behavior and make sure there are no errors in the config using this test app.

vanders2023 commented 1 year ago

Thanks, Buettner. I changed it to the new format. I also ran the test app. Before it told me that the EAP (current) was 0%, but FUTURE was 100%. Now FUTURE is also 0%, so hopefully this will fix the problem.

ravau commented 1 year ago

I allowed private proxy fetch about week ago, however I see only few request from private proxy fetch chrome a day, although my site has quite huge raffic and 90% originates from chrome. I have quite many top1 rankings in google so I thought proxy fetch would be used more often, site traffic is from Poland,

vanders2023 commented 1 year ago

Thanks, Buettner. I changed it to the new format. I also ran the test app. Before it told me that the EAP (current) was 0%, but FUTURE was 100%. Now FUTURE is also 0%, so hopefully this will fix the problem.

I checked this morning, and there were still lots of accesses yesterday. What IP does it use to read the traffic advice? If it is one of those that I am blocking, then it won't be able to read it.

buettner commented 1 year ago

Ah, yes. The traffic advice fetches come from the same set of IPs.

vanders2023 commented 1 year ago

Oh, no. So is there any way I can stop it without incurring thousands prefetches a day? Why did it start prefetching my site? Shouldn't the default be no prefetching, unless I specify that I want to in the traffic advice?

vanders2023 commented 1 year ago

So I am returning errors 403 (not 401, as I previously stated), to accesses from those IPs. Shouldn't the prefetch robot stop trying to prefetch?

vanders2023 commented 1 year ago

So I am returning errors 403 (not 401, as I previously stated), to accesses from those IPs. Shouldn't the prefetch robot stop trying to prefetch?

So, are there any plans to fix this?

buettner commented 1 year ago

Sorry for the delay. I was on vacation and Monday was a holiday.

It looks like our service was getting a DNS error when fetching your traffic advice. Not sure why that would be the case, but it seems likely it's related to the filtering you set up for specific IP blocks.

In any case, I disabled the feature for your domain.

vanders2023 commented 1 year ago

In any case, I disabled the feature for your domain.

Thank you very much, Buettner! I saw a big reduction yesterday on the number of accesses from those IPs.

elboresLLS commented 1 year ago

I need this ips to minimize hits too, I have tried to put the traffic advice file but the validation says me that need content type MIME application/traffic-advice+json (or something similar). How can I upload it correctly with this format? Too much prefetch hits from this ips... a few its ok but sometimes (depending the hour of the day) I have more than 15 per minute.... Thats crazy.

EDIT: I get it, let see if the amount of visits goes down. I have just putted fraction 0,1

xvilo commented 8 months ago

@buettner in your comment from Feb 18, 2022 you mentioned the update mailing list. When following that URL I get a "content not available" error page on Google Groups. Is this still a thing to follow, or has it been phased out?

buettner commented 8 months ago

It has been phased out as the feature has graduated from the Early Access Program.

Yogurteg commented 5 months ago

This prefetching algo use only for html to parse and show for client, or work also like prefetcher for website's static CDN, extermal scripts, fonts, etc?

buettner commented 5 months ago

Correct, only the mainframe HTML is prefetched.