WICG / compression-dictionary-transport

Other
92 stars 8 forks source link

Hashes, algorithm agility, and overlap with HTTP digests. #9

Closed LPardue closed 1 year ago

LPardue commented 1 year ago

The explainer describes that the client/and server generate SHA-256 hashes and then use those to coordinate. Is there a specific reason why algorithm agility is not built in to the protocol? In simple terms, the ability to migrate to other algorithms as the security environment evolves.

The more I look at this aspect, the more it gets me thinking about whether the design has some overlap with the HTTP digests specification https://httpwg.org/http-extensions/draft-ietf-httpbis-digest-headers.html

The explainer hints at wanting to constrain the size of the sec-bikeshed-available-dictionary field value via

SHA-256 hashes are long. Their hex representation would be 64 bytes, and we can base64 them to be ~42 (I think). We can't afford to send many hashes for both performance and privacy reasons.

but I wonder how much this really matters in practice.

If we adopted a similar approach that digests use, you could make sec-bikeshed-available-dictionary be a Structured Fields dictionary that can convey 1 or more hash values alongside their indicated algorithm e.g.

sec-bikeshed-available-dictionary:
  sha-256=:d435Qo+nKZ+gLcUHn7GQtQ72hiBVAgqoLsZnZPiTGPk=:,
  sha-512=:YMAam51Jz/jOATT6/zvHrLVgOYTGFy1d6GJiOHTohq4yP+pgk4vf2aCs
  yRZOtw8MjkM7iw7yZ/WkppmM44T3qg==:

Even if you restrict to only adding one hash, you can still benefit from agility via sending the algorithm

LPardue commented 1 year ago

Also, the server could send an equivalent to an Integrity preference field https://httpwg.org/http-extensions/draft-ietf-httpbis-digest-headers.html#section-4 to signal to the client what hash algorithms it uses, which could help picking the most suitable compatible algorithm

pmeenan commented 1 year ago

Allowing different algorithms makes sense. I assume the hash algorithm negotiation would happen at the time of setting the dictionary as available so that could also be done cleanly.

I'm worried about allowing multiple values though and impact on Vary: cardinality. If the delta-compressed asset is stored in an edge cache varied on the requesting available-dictionary then things could get out of hand with combinations unless there is some way to signal a specific value of the request header that the response matched for the Vary.

LPardue commented 1 year ago

The caching aspects seem like a valid consideration, it might benefit to put that in the explainer to cover why just a single algorithm is currently picked.

Then we work in parallel to figure out if agility can be implemented vs the tradeoffs.

yoavweiss commented 1 year ago

Allowing different algorithms makes sense. I assume the hash algorithm negotiation would happen at the time of setting the dictionary as available so that could also be done cleanly.

+1 to that.

Is there a specific reason why algorithm agility is not built in to the protocol? In simple terms, the ability to migrate to other algorithms as the security environment evolves.

I thought it's not strictly necessary, as we're not relying on the hash for cryptographic purposes, so aren't really concerned with collisions. With that said, the cost of negotiating a protocol seem low enough..

I'm worried about allowing multiple values though and impact on Vary: cardinality. If the delta-compressed asset is stored in an edge cache varied on the requesting available-dictionary then things could get out of hand with combinations unless there is some way to signal a specific value of the request header that the response matched for the Vary.

Yeah, I won't be supportive of multiple values. It would add a lot of complexity, for no apparent reason.

LPardue commented 1 year ago

Just to double check my understanding and to clarify thngs. When I said multiple values, I meant multiple different hashes of the same content that use strictly different algorithms. I didn't mean sending hashes of different content using the same algorithm (HTTP digests doesn't permit that by virtue of the SF dictionary type).

SRI does allow both of these modes but I dont see a strong reason for the latter.

And while I mention SRI, there's potentially some things we could borrow. Note that it prohibits MD5 or other weak algos, while requiring user agents to support sha-256, sha-384, and sha-512

yoavweiss commented 1 year ago

I think this is fundementally different from SRI.

With SRI, we are trying to cryptographically protect a resource, so cryptographic strength matters, and collisions put users at risk. (attackers can switch files on them and replace an innocous payload with a malicious one)

Here, collisions are significantly less likely (the collision space is dictionaries on that particular origin/scope), and if they happen, they result would most likely be a corrupted resource, rather than a malicious one. More importantly, no one would be incentivized to find and "exploit" such collisions. (if you pwned the delivery server, there are easier ways to DoS the vistim site)

So I don't think we'd have a constant need of upgrading the hash strength as new ways to manufacture collisions arise.

LPardue commented 1 year ago

Thanks for the explanation. I tend to agree. It might be good to capture some of this threat modellong in the doc so others can assess.

The hypothetical threat I was thinking is where a collision occurs and can somehow manipulate the outcome of the decompression by affecting what was retrieved. But yeah, the origin scoping probably makes this fine because at the point a server can be manipulated like that, there's even more trivial attacks.