Zstandard Interpretation of Dictionary

felixhandte commented 11 months ago

Zstandard can use both structured and raw content dictionaries (RFC8878 sec. 5). When a buffer is presented to Zstandard to be used as a dictionary it must be instructed how to interpret it. (If a properly formatted dictionary is used as a structured dictionary by the compressor and as a raw content dictionary by the decompressor, or vice-a-versa, the reconstructed output will likely differ from the original content.)

The three options provided by zstd are:

Auto: See if the leading bytes match the magic for a Zstandard-formatted dictionary. If so, interpret it as a formatted dictionary. Otherwise, interpret it as a raw dictionary.
Raw: Ignore the header even if it looks like a structured dictionary. Use the buffer as raw content.
Full: Interpret the dictionary as a structured dictionary. Fail if it doesn't conform.

One option is to use the MIME type of the resource being used as a dictionary (as discussed in #44) to signal how it should be interpreted. But simpler might just be to use the auto-interpretation mechanism.

Whatever we choose, the description of the zstd-d content-encoding should be updated to be explicit about this.

felixhandte commented 11 months ago

It is slightly tricky because, now that I think about it more, the auto-detection mechanism has one downside. At least in the zstd implementation, dictionary materialization will fail if the dictionary starts with the right magic but the rest of the header is malformed (rather than fall back to treating it as a raw dictionary).

It's easy to imagine that this opens a window for mischief if an attacker can engineer content to hit this case. In practice I don't think it actually does, since the compression side will fail in the same way. Curious to hear your thoughts on how much of a dealbreaker that is.

pmeenan commented 11 months ago

I'd be inclined to only support raw dictionaries. That is required for the same-resource version update case anyway and the same dictionary can be used independent of encoding.

It would take some significant benefits of the encoded dictionaries to justify the added negotiation complexity.

On Mon, Jul 31, 2023 at 3:12 PM Felix Handte @.***> wrote:

It is slightly tricky because, now that I think about it more, the auto-detection mechanism has one downside. At least in the zstd implementation https://github.com/facebook/zstd/blob/v1.5.5/lib/decompress/zstd_ddict.c#L112-L114, dictionary materialization will fail if the dictionary starts with the right magic but the rest of the header is malformed (rather than fall back to treating it as a raw dictionary).

It's easy to imagine that this opens a window for mischief if an attacker can engineer content to hit this case. In practice I don't think it actually does, since the compression side will fail in the same way. Curious to hear your thoughts on how much of a dealbreaker that is.

— Reply to this email directly, view it on GitHub https://github.com/WICG/compression-dictionary-transport/issues/45#issuecomment-1658985198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBJRYFC7JGTAV6WXT5LXS77QXANCNFSM6AAAAAA26SUSBQ . You are receiving this because you are subscribed to this thread.Message ID: @.*** com>

horo-t commented 11 months ago

Brotli library also supports serialized format.

I think it is reasonable to use content-type response header for the negotiation for the dedicated dictionary case.

pmeenan commented 11 months ago

I'm still a bit concerned about the complexity of carrying encoding-specific dictionaries. I don't mind adding support for it to the spec but the clients would need to know to look for specific dictionary types and not advertise dictionary support if the dictionary they got did not match a known type.

That might be better negotiated through the use-as-dictionary params with a "type" param that defaults to "raw" or something like that and require any clients that receive an unknown type to ignore the dictionary. It would also require the client to only advertise the appropriate accept-encoding that matches the dictionary (where raw dictionaries could be used in either case).

If the added savings is worth the complexity for the stand-alone dictionary case, now is the time to add it so that we don't end up with clients trying to use dictionaries that they can't support.

pmeenan commented 11 months ago

@felixhandte @horo-t could you look at the PR and see if that looks like it would do what you'd like without adding too much complexity?

It adds a type param to the use-as-dictionary negotiation that allows for format-specific dictionaries and requires that clients ignore dictionary types that they don't support.

The main downside that comes to mind is there is no way to negotiate from the client side what types of dictionaries it would support for the server to pick an appropriate one. If some browsers ship support for only raw and others ship support for the serialized formats, a server would need to either support the lowest common denominator (with raw) or only support the browsers that support the structured format.

We could add a negotiation of some kind to the link rel=dictionary fetch path (maybe on the accept: side of things) but it's already starting to feel a bit messy.

horo-t commented 11 months ago

I think if we will introduce MIME types for dictionaries, we can simply use the Accept header.

For example, when <link rel=dictionary> triggers a dictionary fetch, the request header should have the Accept header.

Accept: application/x-raw-dict, application/x-brotli-dict

And the server should return the dictionary with Content-Type header which indicates the type of the dictionary

Content-Type: application/x-brotli-dict

If the browser supports the Zstandard-formatted dictionary, the the Accept header should be:

Accept: application/x-raw-dict, application/x-zstd-dict

If we have those new MIME types, I don't think we need to have type field in use-as-dictionary header

pmeenan commented 11 months ago

I don't think we can rely on the mime types.

Any arbitrary response can be used as a dictionary (in the normal fetch case for versioned resources).
We don't differentiate the link rel=dictionary fetches (intentionally). It's just a trigger for an out-of-band fetch.
The client needs to know to ignore any dictionaries that the server would expect to be used as a specific format that the client doesn't understand.

That would mean the client would need to look for application/x-*-dict and ignore just the types it doesn't understand.

The support for structured/serialized dictionaries feels a bit risky in general because the same file hash would work as a raw dictionary but produce invalid results (or fail the decode) if the server thought the client supported the extended formats.

With type we could at least force all clients to ignore types they don't understand and leave it for future expansion without the clients needing to know the specifics now. Even then it still makes me a little nervous that the same client-advertised hash doesn't convey "how" the dictionary would be used since it would still be perfectly valid as a raw dictionary. It would be up to the origin to ensure that that dictionary ALWAYS carried the type param in the response when it is being stored and at the time of sec-available-dictionary we assume that the client and server have already agreed on how a given dictionary hash is to be used (which works for pre-loading dictionaries in clients as well).

horo-t commented 11 months ago

Ah, I agree that we need to introduce type in the use-as-dictionary header to properly handle (block) unsupported formats.

But I think that the MIME types in the Accept header are still useful for negotiation. Without it, the server cannot know what formats the browser supports.

Do you think it is too complicated to use both the new MIME types and the new type in the use-as-dictionary header? We need to support the serialized format only when fetching dictionaries using <link rel=dictionary>. So we need to set Accept: application/x-brotli-dict only when fetching dictionaries using <link rel=dictionary>.

horo-t commented 11 months ago

+CC: @nidhijaju who is working on Shared Zstd support on Chromium. https://chromium-review.googlesource.com/c/chromium/src/+/4664191 In this CL, ZstdSourceStream is calling ZSTD_DCtx_loadDictionary() which uses ZSTD_dct_auto.

pmeenan commented 11 months ago

I tweaked the PR to ONLY support raw currently but allow for future expansion if there is a case for it. I don't think we want to block current work on supporting protocol-specific dictionaries. My guess is that the added benefits from the serialized dictionaries aren't going to be worth the added complexity but this will at least allow for backward-compatibility and adding it in the future without breaking existing clients.

In most of the cases where a format-specific dictionary would be of use, my guess is you could get most of the benefit with a slightly larger raw dictionary or language-specific dictionaries. Since the dictionaries are downloaded async and we're not limited to a single dictionary for all content, we have that kind of flexibility without needing to extract every byte of savings.

horo-t commented 11 months ago

OK. It sounds reasonable to me to support only raw format now and allow allow for future expansion.

LGTMed the PR.

WICG / compression-dictionary-transport

Zstandard Interpretation of Dictionary #45