WICG / compression-dictionary-transport

Other
92 stars 8 forks source link

Compression dictionary transport

What is this?

This explainer outlines the benefits of compression dictionaries, details the different use case for them, and then proposes a way to deliver such dictionaries to browsers to enable these use cases.

The HTTP headers and negotiation are specified in the IETF Draft document for Compression Dictionary Transport.

Summary

This proposal adds support for using designated previous responses as an external dictionary for HTTP responses for compression schemes that support external dictionaries (e.g. Brotli and Zstandard).

HTTP Content-Encoding is extended with new encoding types and support for allowing responses to be used as dictionaries for future requests. All actual header values and names still TBD:

For interop reasons, dictionary-based compression is only supported on secure contexts (similar to brotli compression).

There are also some browser-specific features independent of the transport compression:

Background

What are compression dictionaries?

Compression dictionaries are bits of compressible content known ahead of time. They are being used by compression engines to reduce the size of compressed content.

Because they are known ahead of time, the compression engine can refer to the content in the dictionary when representing the compressed content, reducing the size of the compressed payload. The decompression engine can then interpret the content based on that pre-defined knowledge..

Taken to the extreme, if the compressed content is identical to the dictionary, the entire delivered content be a few bytes referring to the dictionary.

Now, you may ask, if dictionaries are so awesome, then...

Why aren't browsers already using compression dictionaries?

To some extent, they are. The brotli compression scheme includes a built-in dictionary that was built to work reasonably well for HTML, CSS and JavaScript. Custom (shared) dictionaries have a more complicated history.

At some point, Chrome did support a shared compression dictionary. When Chrome was first released, it supported a dictionary compression method called SDCH (Shared-dictionary Compression over HTTP). That support was unshipped in 2016 due to complexities around the protocol’s implementation, specification and lack of an interoperability story.

SDCH enabled Chrome and Chromium-based browsers to create origin-specific dictionaries, that were downloaded once for the origin and enabled multiple pages to be compressed with significantly higher rates. That's one use case for compression dictionaries we will call the "Shared dictionary" use case.

There's another major use case for shared dictionaries that was never supported by browsers - delta compression.

That use-case would enable the browser to reuse past resources (e.g. your site's main JS v1.2) in order to compress future ones (e.g. main JS v1.3). But traditionally, this use-case raised complexities around the abilities of the browser to coordinate its cache state with the server, and agree on what the dictionary would be. It also raised issues with both sides having to store all past versions of each resource in order to successfully be able to compress and decompress it.

The common thread is that the use of compression dictionaries had run into various complexities over the years which resulted in deployment issues.

This time will be different

A few things about this current proposal are different from past attempts, in ways we're hoping are meaningful:

Use cases

Compression types

There are two primary models for using shared dictionaries that are similar but differ in how the dictionary is fetched:

In both cases the client advertises the best-available dictionary that it has for a given request. If the server has a delta-compressed version of the resource, compressed with the advertized dictionary, it can just send that delta-compressed diff. It can also use that advertized dictionary (if available) to dynamically compress that resource.

With the Delta compression use case, a previously-downloaded version of the resource is available to use for future requests as a dictionary. For example, with a JavaScript file, v1 of the file may be in the browser's cache and available for use as a dictionary to use when fetching v2 so only the difference between the two needs to be transmitted.

In the Shared dictionary use case, the dictionary is a purpose-built dictionary that is fetched using a <link> tag and can be used for future requests that match the match URL pattern covered by the dictionary. For example, on a first visit to a site, the HTML response references a custom dictionary that should be used for document fetches for that origin. The dictionary is downloaded at some point by the browser and, on future navigations through the site, is advertised as being available for document requests that match the URL pattern that the dictionary applies to.

Risks

Security

The Shared Brotli draft does a good job describing the security risks. In summary:

Privacy

Dictionaries will need to be cached using a triple key (top-level site, nested context site, URL) similar to other cached resources (or any other partitioning scheme that’s good enough for cached resources and cookies from a privacy and security perspective). That’s not an issue for the delta compression use case, but can become a burden fast for the out-of-band dictionaries, as multiple nested contexts may need to download the same dictionary multiple times.

Note: Common payload caching may be useful in such cases.

There’s also the issue of users advertising resource versions in their cache to servers as part of the request. This already has a precedence in terms of cache validators (ETags, If-Modified-Since), so maybe that’s fine, given that the cache is partitioned.

Adverse performance effects

Downloading an out-of-band dictionary means that the site owner is making a certain bet regarding the amount of visits that would enable the user to amortize that dictionary’s cost.

At worst, if the user never visits the site again until the dictionary’s lifetime expires, the user has paid the cost of downloading the dictionary with no benefits.

For some large and heavily trafficked sites, that case is rare. For others, it’s extremely common, and we should be wary of both the tools we’d be putting in developers’ hands, as well as the messaging we’re providing them regarding when to use them.

Proposal

Static resources flow

In this flow, we’re reusing static resources themselves as dictionaries that would be used to compress future updates of themselves, or similar resources.

Dynamic resources flow

Dictionary options header

The Use-As-Dictionary: response header is a structured field dictionary that allows for setting multiple options and for future expansion. The supported options and defaults are:

For example: use-as-dictionary: match="/app1/main*", match-dest=("script"), id="xxx" would specify matching on a path prefix of /app1/main for script requests and to send Dictionary-ID: "xxx" for any requests that match the dictionary.

Compression algorithms

The dictionary negotiation is independent of the compression algorithm that is used for compressing the HTTP response and is designed to support any compression scheme that supports using external compression dictionaries. Currently that includes Brotli and Zstandard but it is not limited to those (and depends on the what the client and server both support). It is likely that, in the future, content-specific compression schemes that handle delta-compression better may be built (i.e. code-aware Wasm compression).

The compression algorithm negotiation uses the regular Accept-Encoding:/Content-Encoding: negotiation that is used for non-dictionary compression. It is important that new names are registered with the HTTP Content Coding Registry for algorithms that use an external dictionary to prevent situations where processing along the request flow may attempt to decode a response using just the algorithm without being dictionary-aware. That way, if anything in the request flow needs to operate on the decoded content, it can either be made aware of the dictionary-based compression or it can modify the Accept-Encoding: request header to only support schemes that it is aware of (already common practice).

The examples in this document will use br-d for dictionary-based Brotli compression but the actual algorithm(s) negotiated could be anything that the client supports.

Compression API

The compression API can also expose support for using caller-supplied dictionaries but that is out-of-scope for this proposal.

Websockets

Websocket support is out-of-scope for this proposal but there is nothing in the current dictionary negotiation that precludes websockets from being able to build dictionary-based compression (either by leveraging parts of what is provided here or building something separate).

Security and Privacy

Dictionary and Resource readability (CORS)

Since the contents of the dictionary and compressed resource are both effectively readable through side-channel attacks, this proposal makes it explicit and requires that both be CORS-readable from the document origin. The origin for the URL the dictionary was served from and the origin of the match pattern for URLs MUST be the same (i.e. the dictionary and compressed resource must both be from the same origin).

For dictionaries and resources that are same-origin as the document, no additional requirements exist as both are CORS-readable from the document context. For navigation requests, their resource is by definition same-origin as the document their response will eventually commit. As a result, the dictionaries that match their URL pattern are similarly same-origin.

For dictionaries and resources served from a different origin than the document, they must be CORS-readable from the document origin. e.g. Access-Control-Allow-Origin: <document origin or *>. This means that any crossorigin content that is fetched in no-cors mode by default must enable CORS-fetching (usually with the crossorigin attribute).

When sending a CORS request with an available dictionary, a browser should only include the Available-Dictionary: header if it is also sending the sec-fetch-mode: header so a CORS-readable decision can be made on the server before responding.

In order to prevent sending dictionary-compressed responses that the client will not be able to process, when a server receives a request with sec-fetch-mode: cors as well as a Available-Dictionary: dictionary, it should only use the dictionary if the response includes a Access-Control-Allow-Origin: response header that includes the origin of the page the request was made from. Either by virtue of Access-Control-Allow-Origin: * covering all origins or if Access-Control-Allow-Origin: includes the origin in the origin: or referer: request header. If there is no origin: or referer: request header and Access-Control-Allow-Origin: is not * then the dictionary should not be used.

To discourage encoding user-specific private information into the dictionaries, any out-of-band dictionaries fetched using a <link> will be uncredentialed fetches.

These protections against compressing opaque resources make CORB and ORB considerations unnecessary as they are specific to protecting opaque resources.

Fingerprinting

The existence of a dictionary is effectively a cookie for any requests that match it and should be treated as such:

The existence of support for dictionary-based Accept-Encoding: has the potential to leak client state information if not applied consistently. If the browser supports dictionary-based compression algorithms encoding then it should always be advertised, independent of the current state of the feature. Specifically, this means that in any private browsing mode (Incognito in Chrome), dictionary-based algorithm support should still be advertised even if the dictionaries will not persist so that the state of the private browsing mode is not exposed.

Triggering dictionary fetches

The explicit fetching of a dictionary through a <link rel=dictionary> tag or Link: header is functionally equivalent to <link rel=preload> with different priority and should be treated as such. This means that the Link: header is only effective for document navigation responses and can not be used for subresource loads.

This prevents passive resources, like images, from using the dictionary fetch as a side-channel for sending information.

Cache/CDN considerations

Any caches between the server and the client will need to be able to support Vary on both Accept-Encoding and Available-Dictionary, otherwise the responses will be either corrupt (in the case of serving a dictionary-compressed resource with the wrong dictionary) or ineffective (serving a non-dictionary-compressed resource when dictionary compression was possible).

Any middle-boxes in the request flow will also need to support the dictionary-compressed content-encoding, either by passing it through unmodified or by managing the appropriate dictionaries and compressed resources.

Examples

Bundled JavaScript on separate origin

In this example, www.example.com will use a bundle of application JavaScript that they serve from a separate static domain (static.example.com). The JavaScript files are versioned and have a long cache time, with the URL changing when a new version of the code is shipped.

On the initial visit to the site:

sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<script src="https://github.com/WICG/compression-dictionary-transport/raw/main//static.example.com/app/main.js/123" crossorigin>...
Browser->>static.example.com: GET /app/main.js/123<br/>Accept-Encoding: br,gzip
static.example.com->>Browser: Use-As-Dictionary: match="/app/main.js"<br/>Access-Control-Allow-Origin: https://www.example.com<br/>Vary: Accept-Encoding,Available-Dictionary

At build time, the site developer creates delta-compressed versions of main.js using previous builds as dictionaries, storing the delta-compressed version along with the SHA-256 hash of the dictionary used (e.g. as main.js.<hash>.br-d).

On a future visit to the site after the application code has changed:

It could have also included a new Use-As-Dictionary: match="/app/main.js*" response header to have the new version of the file replace the old one as the dictionary to use for future requests for the path but that is not a requirement for the existing dictionary to have been used.

sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<script src="https://github.com/WICG/compression-dictionary-transport/raw/main//static.example.com/app/main.js/125" crossorigin>...
Browser->>static.example.com: GET /app/main.js/125<br/>Accept-Encoding: br-d,br,gzip<br/>sec-fetch-mode: cors<br/>Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:
static.example.com->>Browser: Content-Encoding: br-d<br/>Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:<br/>Access-Control-Allow-Origin: https://www.example.com<br/>Vary: Accept-Encoding,Available-Dictionary

Site-specific dictionary used for all document navigations in a part of the site

In this example, www.example.com has a custom-built dictionary that should be used for all navigation requests to /product.

On the initial visit to the site:

sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<link rel=dictionary href="https://github.com/WICG/compression-dictionary-transport/blob/main/dictionaries/product_v1.dat">...
Browser->>www.example.com: GET /dictionaries/product_v1.dat<br/>Accept-Encoding: br,gzip
www.example.com->>Browser: use-as-dictionary: match="/product/*", match-dest=("document"), id="product_v1"

At some point after the dictionary has been fetched, the user clicks on a link to https://www.example.com/product/myproduct:

sequenceDiagram
Browser->>www.example.com: GET /product/myproduct<br/>Accept-Encoding: br-d,br,gzip<br/>Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:<br/>Dictionary-ID: "product_v1"
www.example.com->>Browser: Content-Encoding: br-d<br/>Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:

Changelog

These are the changes that have been made to the specs as it has progressed through various standards organizations and based on developer feedback during browser experiments.

Feb 2023