facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.09k stars 2.06k forks source link

HTTP custom dictionary auto discovery #2853

Open boenrobot opened 2 years ago

boenrobot commented 2 years ago

Is your feature request related to a problem? Please describe.

Zstd would be awesome if implemented in browsers with standard dictionaries, and from what I've read, there are efforts to craft such standard dictionaries for common web formats that would provide some better compression for typical cases.

However, I think nothing would beat a custom dictionary trained on a specific site. e.g. if I have a static generated site, I could include zstd dictionary training as part of the build process to generate an optimal dictionary file one time, and then have the smallest download sizes possible, even less so than gzip.

To this end, there is a problem currently on a specification level (not to mention implementation level...) - there is no way for client and server to automatically coordinate on a dictionary file. The use of a dictionary is currently only possible if both ends know and set the dictionary in advance, which very much makes dictionaries unusable.

Describe the solution you'd like I think the best way would be to register with IANA a /.well-known/ location with a dictionary ID map... Let's say /.well-known/zstd-dict-ids.json. As the name suggests, it would be a JSON file. Why JSON and not a custom even more compact format? To enable easier auto generation via existing tooling.

It might f.e. have the following form:

[
  {
    "p": "/blog",
    "d": {
      "": "/url/of/default/dictionary",
      "32768": "/url/of/dictionary/for/id/32768",
      "32769": "/url/of/dictionary/for/id/32769"
    }
  }
]

i.e. an array, each member being an object with an ID map (the key d) for specific path prefixes (the key p), with the path prefix defaulting to "/", i.e. "all files".

I'm not strongly attached to this exact JSON schema, but it being JSON and allowing for different URLs to use different dictionaries for the same dictionary ID is I think would be great. JSON as format for easier generation as already stated, and reusing IDs for different paths would be useful in environments where files pre-generated from different sources are being aggregated. I get that in a typical scenario, a user will have just one set for all of their files (just different IDs per file type). Anyway...

User agents would be expected to access that location for each origin on which they find a zstd response. Once they have the map, they can download the dictionary needed for the resource, and also pre-fetch other dictionaries in the background before requests for such files are even made. The map and each dictionary can be cached according to their respective caching headers, like any other resource.

If a user agent doesn't have a caching mechanism, or caching is otherwise hindered/disabled, it can still perform this procedure, though that does mean 3 HTTP requests (the compressed file, the map, the dictionary) per resource instead of just 1 or 2, but this is well worth it for user agents that do have caching, where they only make N (compressed files) + 1 (map) + M (dictionaries; typically much fewer than the resources, at worst equal to the number of resources) requests.

Describe alternatives you've considered I thought about adding an option to the Transfer-Encoding HTTP header, along with the format, e.g. Transfer-Encoding: zstd;d=/url/to/dictionary but depending on the number of requests, size of each resource and length of the URL of the dictionary, the benefit of including the dictionary file path may outweigh the gains from a custom dictionary compared to a standardized one.

And that's not counting issues like the ones discussed in https://github.com/facebook/zstd/issues/2713 with modifying the HTTP protocol to support this.

I guess this extra header parameter is not mutually exclusive with a dictionary ID map... If both were supported and provided, I guess the header would just take precedence.

Additional context The above are all just ideas on how to solve this, but I believe some sort of a solution to the custom dictionary problem will be needed before a wide adoption in browsers and non-browser HTTP user agents, as zstd's unique selling point compared to gzip/brotli is just hardly realized without dictionaries in place.

felixhandte commented 2 years ago

Hi @boenrobot,

I'm glad to hear this is something you're passionate about! I too really want to see this happen.

Unfortunately, there are non-trivial challenges that have hindered progress.

So we are pursuing an incremental strategy:

Step 1 is to get dictionary-less zstd into browsers. There's been some recent activity on this front in Chrome and at the W3C TPAC, so I'm hoping we see movement on this in the near future.

Step 2 is to ship a set of static dictionaries and standardize a means of using them. I hope to investigate this soon.

Step 3 will be to pursue dynamic/custom dictionaries. While we've deployed a custom scheme at Facebook (you can see some discussion here: mitmproxy/mitmproxy#4394), much work is required to turn it into a viable protocol for the open internet.

But I will take all the help I can get! If you'd like to pitch in, there are a lot of different ways to do that. Probably the simplest is to make your voice heard in these forums (HTTPWG, etc.) and let folks know that this is something you want to see.

boenrobot commented 2 years ago

Consensus: Driving consensus in the internet is like herding cats. :)

Historically speaking, from what I've observed as a user watching standard committee public transcripts, it seems the easiest way to drive consensus is to have a draft spec with a reference implementation for the most popular open source client and servers that is also backwards compatible with whatever the current one is. This enables big players (in this case big CDNs like Akamai or CloudFlare) to install the reference server on their servers, which in turn drives power users to try it with the reference client, which in turn drives up adoption even more.

But of course, there's the risk of the reference implementation and associated draft spec of drastically changing by the time it hits an official status, especially if there's complexities in the interactions... The author of the reference implementations needs to be ready for that and drop them when things change.

To this end, a feature branch in Chromium that would support this plus a patch for the Nginx module would be all the reference implementations needed... I wish I was good enough with C++ to contribute those, but alas, I'm just a web developer.

Complexity: The mechanisms, especially on the client side, are potentially complex. To name one issue, there are lots of complicated cache interactions.

... And this is part 1 of why there isn't such an implementation. Part 2 being that there isn't even a clear draft spec that an implementer can point to, so that other implementers can point out their concerns.

Step 2 is to ship a set of static dictionaries and standardize a means of using them. I hope to investigate this soon.

Step 3 will be to pursue dynamic/custom dictionaries.

I believe if these two steps are swapped, adoption can happen quicker. Having support for custom dictionaries enables users and implementers alike to see the benefits of dictionaries in action, as well as mitigate problems early. It would also enable standard dictionaries to initially be just custom dictionaries that are eventually pre-shipped in browsers, so that a request to them is not even needed, further decreasing the overall bandwidth in typical cases. If problems with the standard dictionaries are found, it also enables everyone to fallback on custom dictionaries that fix the problems found.

Security: Dictionary-based compression opens a whole can of worms in terms of security. I've heard from pretty much all of the relevant parties that this is a blocking issue. I've been slowly working on an RFC to get some clarity on the problem and hopefully get agreement on what would make such a scheme compatible with the internet's security goals.

I'll admit security is not something I thought much about as I was writing my initial message... But I think my proposal above would not create a new security risk if the dictionaries pointed to by the map (or header param) are resources following the same origin policy (and/or require CORS headers). That, plus only allowing servers to determine dictionaries, and clients to optionally reuse the map provided ones in requests (EDIT: Only if the map declares this as allowed somehow), but never ask the server to go get an external dictionary.

I read through the RFC's pointed issues, and I think that alone covers it.

Though one extra point was raised as I was going through it... I've been thinking of dictionaries as a given, but since they're kind of an extra feature, there's a need for the client to somehow advertise this support in general, as well as its preference on whether to use it, and in a perfect world, even advertise its decompression capabilities, so that the server can pick a dictionary that would most likely result in a successful decompression.

My full conclusions when evaluating my proposal above against the checklist...

tomByrer commented 1 year ago
  • Security: Dictionary-based compression opens a whole can of worms in terms of security.

I'm curious how brotli got accepted into browsers? While it seems brotli's dictionary was pre-set for a few years, it is ~1/3 of the compression work at higher levels.

devicenull commented 3 months ago

Seems like chrome added support for this awhile ago: https://use-as-dictionary.com/