HTTP custom dictionary auto discovery

boenrobot commented 2 years ago

Is your feature request related to a problem? Please describe.

Zstd would be awesome if implemented in browsers with standard dictionaries, and from what I've read, there are efforts to craft such standard dictionaries for common web formats that would provide some better compression for typical cases.

However, I think nothing would beat a custom dictionary trained on a specific site. e.g. if I have a static generated site, I could include zstd dictionary training as part of the build process to generate an optimal dictionary file one time, and then have the smallest download sizes possible, even less so than gzip.

To this end, there is a problem currently on a specification level (not to mention implementation level...) - there is no way for client and server to automatically coordinate on a dictionary file. The use of a dictionary is currently only possible if both ends know and set the dictionary in advance, which very much makes dictionaries unusable.

Describe the solution you'd like I think the best way would be to register with IANA a /.well-known/ location with a dictionary ID map... Let's say /.well-known/zstd-dict-ids.json. As the name suggests, it would be a JSON file. Why JSON and not a custom even more compact format? To enable easier auto generation via existing tooling.

It might f.e. have the following form:

[
  {
    "p": "/blog",
    "d": {
      "": "/url/of/default/dictionary",
      "32768": "/url/of/dictionary/for/id/32768",
      "32769": "/url/of/dictionary/for/id/32769"
    }
  }
]

i.e. an array, each member being an object with an ID map (the key d) for specific path prefixes (the key p), with the path prefix defaulting to "/", i.e. "all files".

I'm not strongly attached to this exact JSON schema, but it being JSON and allowing for different URLs to use different dictionaries for the same dictionary ID is I think would be great. JSON as format for easier generation as already stated, and reusing IDs for different paths would be useful in environments where files pre-generated from different sources are being aggregated. I get that in a typical scenario, a user will have just one set for all of their files (just different IDs per file type). Anyway...

User agents would be expected to access that location for each origin on which they find a zstd response. Once they have the map, they can download the dictionary needed for the resource, and also pre-fetch other dictionaries in the background before requests for such files are even made. The map and each dictionary can be cached according to their respective caching headers, like any other resource.

If a user agent doesn't have a caching mechanism, or caching is otherwise hindered/disabled, it can still perform this procedure, though that does mean 3 HTTP requests (the compressed file, the map, the dictionary) per resource instead of just 1 or 2, but this is well worth it for user agents that do have caching, where they only make N (compressed files) + 1 (map) + M (dictionaries; typically much fewer than the resources, at worst equal to the number of resources) requests.

Describe alternatives you've considered I thought about adding an option to the Transfer-Encoding HTTP header, along with the format, e.g. Transfer-Encoding: zstd;d=/url/to/dictionary but depending on the number of requests, size of each resource and length of the URL of the dictionary, the benefit of including the dictionary file path may outweigh the gains from a custom dictionary compared to a standardized one.

And that's not counting issues like the ones discussed in https://github.com/facebook/zstd/issues/2713 with modifying the HTTP protocol to support this.

I guess this extra header parameter is not mutually exclusive with a dictionary ID map... If both were supported and provided, I guess the header would just take precedence.

Additional context The above are all just ideas on how to solve this, but I believe some sort of a solution to the custom dictionary problem will be needed before a wide adoption in browsers and non-browser HTTP user agents, as zstd's unique selling point compared to gzip/brotli is just hardly realized without dictionaries in place.

felixhandte commented 2 years ago

Hi @boenrobot,

I'm glad to hear this is something you're passionate about! I too really want to see this happen.

Unfortunately, there are non-trivial challenges that have hindered progress.

Security: Dictionary-based compression opens a whole can of worms in terms of security. I've heard from pretty much all of the relevant parties that this is a blocking issue. I've been slowly working on an RFC to get some clarity on the problem and hopefully get agreement on what would make such a scheme compatible with the internet's security goals.
Complexity: The mechanisms, especially on the client side, are potentially complex. To name one issue, there are lots of complicated cache interactions.
Consensus: Driving consensus in the internet is like herding cats. :)

So we are pursuing an incremental strategy:

Step 1 is to get dictionary-less zstd into browsers. There's been some recent activity on this front in Chrome and at the W3C TPAC, so I'm hoping we see movement on this in the near future.

Step 2 is to ship a set of static dictionaries and standardize a means of using them. I hope to investigate this soon.

Step 3 will be to pursue dynamic/custom dictionaries. While we've deployed a custom scheme at Facebook (you can see some discussion here: mitmproxy/mitmproxy#4394), much work is required to turn it into a viable protocol for the open internet.

But I will take all the help I can get! If you'd like to pitch in, there are a lot of different ways to do that. Probably the simplest is to make your voice heard in these forums (HTTPWG, etc.) and let folks know that this is something you want to see.

boenrobot commented 2 years ago

Consensus: Driving consensus in the internet is like herding cats. :)

Historically speaking, from what I've observed as a user watching standard committee public transcripts, it seems the easiest way to drive consensus is to have a draft spec with a reference implementation for the most popular open source client and servers that is also backwards compatible with whatever the current one is. This enables big players (in this case big CDNs like Akamai or CloudFlare) to install the reference server on their servers, which in turn drives power users to try it with the reference client, which in turn drives up adoption even more.

But of course, there's the risk of the reference implementation and associated draft spec of drastically changing by the time it hits an official status, especially if there's complexities in the interactions... The author of the reference implementations needs to be ready for that and drop them when things change.

To this end, a feature branch in Chromium that would support this plus a patch for the Nginx module would be all the reference implementations needed... I wish I was good enough with C++ to contribute those, but alas, I'm just a web developer.

Complexity: The mechanisms, especially on the client side, are potentially complex. To name one issue, there are lots of complicated cache interactions.

... And this is part 1 of why there isn't such an implementation. Part 2 being that there isn't even a clear draft spec that an implementer can point to, so that other implementers can point out their concerns.

Step 2 is to ship a set of static dictionaries and standardize a means of using them. I hope to investigate this soon.

Step 3 will be to pursue dynamic/custom dictionaries.

I believe if these two steps are swapped, adoption can happen quicker. Having support for custom dictionaries enables users and implementers alike to see the benefits of dictionaries in action, as well as mitigate problems early. It would also enable standard dictionaries to initially be just custom dictionaries that are eventually pre-shipped in browsers, so that a request to them is not even needed, further decreasing the overall bandwidth in typical cases. If problems with the standard dictionaries are found, it also enables everyone to fallback on custom dictionaries that fix the problems found.

Security: Dictionary-based compression opens a whole can of worms in terms of security. I've heard from pretty much all of the relevant parties that this is a blocking issue. I've been slowly working on an RFC to get some clarity on the problem and hopefully get agreement on what would make such a scheme compatible with the internet's security goals.

I'll admit security is not something I thought much about as I was writing my initial message... But I think my proposal above would not create a new security risk if the dictionaries pointed to by the map (or header param) are resources following the same origin policy (and/or require CORS headers). That, plus only allowing servers to determine dictionaries, and clients to optionally reuse the map provided ones in requests (EDIT: Only if the map declares this as allowed somehow), but never ask the server to go get an external dictionary.

I read through the RFC's pointed issues, and I think that alone covers it.

Though one extra point was raised as I was going through it... I've been thinking of dictionaries as a given, but since they're kind of an extra feature, there's a need for the client to somehow advertise this support in general, as well as its preference on whether to use it, and in a perfect world, even advertise its decompression capabilities, so that the server can pick a dictionary that would most likely result in a successful decompression.

My full conclusions when evaluating my proposal above against the checklist...

4.1. Revealing Message Content Obviously that's a concern only for dynamically generated content over an encrypted channel, and not static one. Static one is to be guarded only by auth mechanisms. And any unencrypted content is already revealed. So with that in mind...
- 4.1.1. By Observing Which Dictionary is Used If the resource and dictionary are both obtained through an encrypted connection, an attacker on the side can't observe which dictionary is being used. The attacker would need to somehow first know the endpoint being accessed by their target user AND have access to that endpoint themselves, so that they can know what dictionary that endpoint uses AND have access to the dictionary being used. If the attacker DOES know all of these things, then this does make content revelation easier, since the dictionary is known in advance, whereas in a compression scheme with an "on-the-fly dictionary", the dictionary itself is not known until the full message is known. In practice, all dictionaries would be shared, and some key endpoints like an authentication page would be publically accessible, so the attacker will have knowledge of the dictionary used for a given endpoint BUT when the original endpoint is accessed through an encrypted connection, they wouldn't necessarily know when their target user is accessing the target endpoint. The attacker would need to match the expected content and dictionary against every requested endpoint, which is made even more difficult by HTTP 2.0, where the connection is reused for multiple requests that are potentially out of order, each potentially using a different dictionary, thus making the boundary between each request more difficult to determine. As long as the target endpoint generated unique content on each access, the difficulty of this attack stays the same as a dictionary-less attack over HTTP 1.1. If the target endpoint lacks this, content decryption does become easier.
- 4.1.2. By Observing Message Size If the dictionary is obtained on a separate request from the compressed content OR is not controlled by the attacker, this attack vector becomes impossible. That is, no different from a dictionary-less attack. If the dictionary is obtained on a separate request, but is controlled by the attacker, that doesn't help them obtain the clear uncompressed text out of the compressed ciphertext. If the dictionary is not controlled by the attacker (i.e. is a custom dictionary made by the origin server) and is obtained in the same request, then the dictionary is effectively part of the compressed ciphertext, again not helping the attacker. Only if the dictionary is part of the same request AND controlled by the attacker can they end up obtaining the uncompressed text out of the compressed ciphertext via this method. As long as the server does not let the client define an unknown dictionary in their request and the attacker does not have write access to the dictionary storage on the server, the dictionary for the response message (whether it's included in the response or a separate one) will not be influenced by an attacker. And if an attacker has write access to the server's dictionary storage, they likely already have access to the application logic too, at which point they can always adjust it in a more direct way to obtain the clear uncompressed text of requests and responses.
- 4.1.3. By Observing Timing A header option for the dictionary file would enable this attack vector, yes, as the client wouldn't know the dictionary in advance, and the extra request would take additional time that can be observed. This is not exactly an issue with a map though. The time hit only occurs on the first compressed response, after which point, the client can begin fetching all dictionaries required by an origin, independent of when a request needing them comes. An attacker will at "best" know if any dictionary is used for the first compressed response. If the entire origin uses zstd, then this is hardly useful information, especially if the dictionary used for certain sensitive endpoints is already known, as previously mentioned.
4.2. Revealing Dictionary Content This entire section is a problem of the dictionary training stage, not the content delivery stage. Authors of custom dictionaries should know better than to train the dictionary with real sensitive data. Dictionaries should be created from either anonymized real data or a mocked data, and zstd's docs should reinforce this. As long as only servers create dictionaries, and clients are merely allowed to reuse them to compress their requests, this whole class is covered.
4.3. Manipulating Message Content Again, if only the server is allowed to determine the set of dictionaries for endpoints, and the client may merely optionally reuse the same dictionary in its requests to those same endpoints, I think this is covered. The server will unambiguously tell the client which dictionary ID maps to which dictionary through the map. If the client uses that dictionary ID, the server will only use that one dictionary it already knows about. In a server where different dictionary IDs are used for different path prefixes, the server will need to ensure it treats the dictionary of another path as invalid/unknown dictionary. A response header option should not enable the client to reuse the dictionary for further requests to the endpoint. Only the map should.
4.4. Obfuscating Message Content
- 4.4.1. From Intermediaries As the RFC itself says, this is kind of a questionable concern... If there's no end to end encryption, then proxies and caching servers can do their thing anyway. If they need to decompress the content before they can do their thing, they can, after which they can serve the original content. Security inspection programs can also decompress the content in much the same way a browser can, and if they need to alter it, they can re-compress it using the same dictionary OR (if the client connects to them as a proxy) they may offer the content with a different compression or no compression altogether. If there is end to end encryption, those intermediates can't quite operate anyway, regardless of compression, regardless of whether a dictionary is involved or not anyway, and any ways to work around that make the concerns with workarounds applicable to all content, compressed or not, dictionary or not.
- 4.4.2. Multiple Representations The dictionary ID is embedded into the output, and is determined based on input parameters. The output is derived from the input message combined with the dictionary. So, as long as servers include all parameters involved in picking a dictionary + the dictionary itself + the input as a way to determine equality of compressed output, that equality can work. And if not, sure, it will be broken, like how not including compression parameters can lead to wrong output being delivered. The point is this is a solvable server implementation concern, not so much a protocol concern or client implementation concern. On a spec level, servers are not expected to do this equality thing, as the RFC itself says.
4.5. Tracking Users I'm not sure that tracking users based on this browser capability is a real concern. In an unencrypted communication, the request and response headers reveal this anyway. In an encrypted communication, only the server will know the client's capabilities, and the server needs to know the client's capabilities anyway, in order to either not use a dictionary or use one during compression of the response.
- 4.5.1. Through Dictionary Negotiation Mitigated with same origin/CORS policy.
- 4.5.2. Through Dictionary Retrieval Mitigated with same origin/CORS policy.
4.6. Denial of Service If all dictionaries needed are prefetched by the client and are cached for significantly long times (with the server determining what's suitably long time for it), effectiveness of denial of service attacks would be minimal. In the unlikely event that the dictionary is not retrievable or otherwise times out while the content is still retrievable, the client may make a new request where it advertises preference for a response without a dictionary, resulting in the server giving a new response without a dictionary.
4.7. Resource Exhaustion If a client could provide a dictionary, that could be used a denial of service vector. If only a server can dictate the dictionary, it would presumably provide a dictionary that it can handle (EDIT: Or otherwise declare that trying to use its dictionary for requests will make the requests be rejected). That dictionary may not necessarily be able to be handled by clients, but if that's the case, decompression will simply fail, and users would not be able to use the endpoint (equivalent treatment as if a dictionary-less compression had failed). Servers can adjust their dictionaries to match whatever is the lowest resource hungry dictionary supported by both them and the expected capabilities of their users. In a perfect world, clients would be able to negotiate one of the server known dictionary IDs based on their capabilities, and servers could provide different dictionary IDs for different client devices. But the issues with that are similar to those raised in #2713, in that this preference needs to somehow be expressed in the request along with the overall support/preference of dictionaries. With such capabilities lacking, clients may choose to disable dictionary compression on a per-site basis if a site's dictionary is too resource hungry. Browsers will need to provide this feature, with dictionary compression being on by default of course, potentially automatically falling back to dictionary-less compression for an origin permanently (until cache removal that is) if the dictionary was successfully obtained, yet decompression failed due to lack of memory or a configurable decompression timeout. In practice, the fact gzip can also cause this (on low end devices, with high level compression), combined with this being a rare thing on the web at large (limited mostly to IoT devices), shows memory/cpu limits are rarely hit by decompressors, and in those cases, disabling compression altogether or reducing the level is web authors' preferred solution. For zstd, the equivalent is picking dictionary-less compression. If the dictionary fails to be retrieved, the browser may fallback to using dictionary-less compression for the current session, but retry with a dictionary again on the next user interaction. JavaScript initiated requests (xhr, fetch) meanwhile can just fail, but should ideally provide means for the developer to opt out of using a dictionary and retry, so that they emulate that browser behaviour in a single page application.
4.8. Generating Dictionaries On the fly dictionary generation is a concern for web server modules. In their simplest form, they may just not offer this, but require pre-generated dictionaries. If they do eventually allow this, they would have to effectively collect train data from traffic, offer the site owner a way to inspect and anonymize the traffic before then (all in the background), and regenerate the dictionary at a certain interval, with the dictionary file's caching headers being tied to that interval, so that clients can re-fetch the regenerated dictionary. I see this as being error prone, so this is probably not a good idea for any initial implementation.
4.9. Complexity vs Usability... sure.

tomByrer commented 1 year ago

Security: Dictionary-based compression opens a whole can of worms in terms of security.

I'm curious how brotli got accepted into browsers? While it seems brotli's dictionary was pre-set for a few years, it is ~1/3 of the compression work at higher levels.

devicenull commented 3 months ago

Seems like chrome added support for this awhile ago: https://use-as-dictionary.com/

facebook / zstd

HTTP custom dictionary auto discovery #2853