WICG / nav-speculation

Proposal to enable privacy-enhanced preloading
https://wicg.github.io/nav-speculation/
Other
154 stars 35 forks source link

No-Vary-Search producing multiple matches #210

Closed ricea closed 1 month ago

ricea commented 2 years ago

Consider the following sequence:

  1. Cache is empty
    • Request for https://a/?b=1&c=2
    • A network request is made.
    • Response 1 with No-Vary-Search: params=("b")
    • Request for https://a/?b=2&c=2
    • This doesn't match Response 1 because the value of param b is different, so a new network request is made.
    • Response 2 with No-Vary-Search: params=("c")
    • Request for https://a/?c=2&b=1
    • This matches both Response 1 and Response 2.

What happens in this case?

jeremyroman commented 2 years ago

In principle either response is valid.

The same situation can occur with the existing Vary semantics. HTTP 9111 says on that subject:

If multiple stored responses match, the cache will need to choose one to use. When a nominated request header field has a known mechanism for ranking preference (e.g., qvalues on Accept and similar request header fields), that mechanism MAY be used to choose a preferred response. If such a mechanism is not available, or leads to equally preferred responses, the most recent response (as determined by the Date header field) is chosen, as per Section 4.

Specifically HTTP 9111 determines that age using a combination of Age and Date headers, the current time, the time of the request, and the time the response was received.

It seems reasonable to do the same if No-Vary-Search were implemented in an HTTP cache, and to do something either the same or similar in spirit (possibly as simple as "time the response was received") in other caches, like the prefetch cache.

jeremyroman commented 2 years ago

In principle we could try to infer a ranking preference from the No-Vary-Search, but it's not obvious to me that it would add much value.

ricea commented 2 years ago

The practical difference for browsers is that we only keep one version of a resource and fetch a new one on Vary header mismatch. Whereas for No-Vary-Search we'd have to actually implement the tie-break algorithm.

valenting commented 2 months ago

Could the spec say that the No-Vary-Search header should generally have the same value for all responses, regardless of the value of the params? And if there's a difference, the browser should remember the last value of the header? That way an implementation can save the value of the header under https://a/, instead of needing to check the header for all resources with all query combinations.

jeremyroman commented 2 months ago

Do you mean for all responses from the origin https://a, or all responses for a specific request URI, excluding query (and fragment)?

Having a per-origin policy seems unlikely to be flexible enough in practice.

A per-path one seems more viable, but less flexible (perhaps the "b" param is meaningful only when the "a" param has a specific value, for instance). And it complicates the conceptual model to require a separate index of NVS values per path, rather than having an index be an optimization.

How much does it simplify implementations to be able to store such a value keyed by path only? Browsers could instead store an index of {query, NVS} pairs instead without limiting functionality in a way that would be surprising to me. That is more complicated (especially wrt expiry/eviction), but some of those complications exist with any kind of per-path (rather than per-cached-response) metadata.

valenting commented 2 months ago

Do you mean for all responses from the origin https://a, or all responses for a specific request URI, excluding query (and fragment)?

Having a per-origin policy seems unlikely to be flexible enough in practice.

It should be per path https://a/path not per origin. Sorry for the confusion.

My main concern with the following is that the order in which you load the URL changes if it's a cache hit or miss.

  1. https://a/path?a=2&b=3 No-Vary-Search: params=("b")
  2. https://a/path?a=1 No-Vary-Search: params=("a", "b")

When loaded [1,2] then 2 is a cache miss because 1 says it should vary on a When loaded [2,1] then 1 would be a cache hit, because 2 says no-vary on a and b.

In my opinion a response SHOULD return the same No-Vary-Search regardless what params were in the initial request (for a specific path of course - different paths should be allowed to have different behaviour).

jeremyroman commented 2 months ago

Such cases are possible, but it's also possible for it to sensibly vary on some params depending on other params, or on request headers, etc. (e.g., "the b parameter is meaningful only if a=1" is a coherent policy that won't result in such a thing).

Such cases are also already possible with Vary and request headers. For a contrived example, perhaps two versions of a resource are available, one statically in English (the most common language for the site) and one which builds the page dynamically and can support any language (including English, if need be):

GET /foo HTTP/1.1
Accept-Language: en

200 OK
Content-Language: en
Vary: Accept-Language

Web page in English!
GET /foo HTTP/1.1
Accept-Language: fr

200 OK

<script async>
fetch(`https://localize.example/?page=foo&hl=${navigator.language}`).then(...);
</script>

These have the same property that the order of the requests would affect whether two requests are needed or only one. Not typical, but could be semantically correct in principle.

Some (all?) browser caches somewhat sidestep this by only storing one response per request URI today, but that's a choice of convenience.

The concept of a path (rather than a resource) having its own characteristics would be new to HTTP afaik (except for things like ServiceWorker scopes, I suppose), so I'd want to be sure we're gaining enough to justify such a change.

valenting commented 2 months ago

Some (all?) browser caches somewhat sidestep this by only storing one response per request URI today, but that's a choice of convenience.

But this is the opposite of that, right? It's using one response for multiple request URIs.

I think No-Vary-Search would be a good feature for browsers to implement, but if that requires consulting multiple cache entries in order to find a match, I think it will have a difficult path in being adopted. The algorithm to determine if the browser can use the cache to serve a request should be constant time. The reason I say this is that in Firefox we discovered a while ago that fetching resources from the network was often faster than using the disk cache. Things have improved a bit since SSDs became more common, but it still happens. If the browser needs to scan all of the cache files for all the param combinations, that would make it more difficult to implement this in a way that actually improves performance.

jeremyroman commented 2 months ago

Some (all?) browser caches somewhat sidestep this by only storing one response per request URI today, but that's a choice of convenience.

But this is the opposite of that, right? It's using one response for multiple request URIs.

This is comparable to Vary in principle. I agree that the way browsers implement Vary isn't suitable here (because browsers just take advantage of the fact that, for a single user profile, it's unusual for two responses to be both useful to store, as users don't frequently change their language, their UA's supported content encodings, etc).

I think No-Vary-Search would be a good feature for browsers to implement, but if that requires consulting multiple cache entries in order to find a match, I think it will have a difficult path in being adopted. The algorithm to determine if the browser can use the cache to serve a request should be constant time. The reason I say this is that in Firefox we discovered a while ago that fetching resources from the network was often faster than using the disk cache. Things have improved a bit since SSDs became more common, but it still happens. If the browser needs to scan all of the cache files for all the param combinations, that would make it more difficult to implement this in a way that actually improves performance.

That's fair. I'm just trying to see if there aren't approaches implementations can take to make the most natural (to the author) semantics work. Having a compact index wouldn't be O(1) (unless you bound its size), but would involve less seeking than opening several different files on disk to read their headers.

If we can't, then maybe it is okay to say that authors should supply the same NVS value for the same path, where we still double-check that it's true on the response before we actually serve it as a cache hit -- and that failing to do so would simply result in a cache miss where the response would have been usable. I think I'm better understanding your proposal here as we discuss it.

(This is less of an issue with the navigational prefetch cache in Chromium, which is both small and in memory -- but for the disk cache I agree that implementations will need an adequately fast solution.)

jeremyroman commented 1 month ago

Checked with one property which is using NVS for prefetch today (and I'm hoping their experience would generalize to users in the HTTP cache). When they do, they today send difference NVS header values (params, except=("a" "z1") for requests with z1; params, except=("a" "z2" "z3") for requests with z2 or z3), but combining (params, except=("a" "z1" "z2" "z3")) them would be reasonable and correct. They don't currently send the header at all on other responses, though, and it wouldn't surprise me if that were to become common.

So would this be appropriate, in your mind?

I think this would enable a shortcut implementation, for cases like a browser HTTP cache where one is desirable, like:

  1. Access cache[URL]. If it is a usable response (Vary matches, is fresh, etc), use it and return.
  2. Access cache["LastNoVarySearchHeader:" + StripQuery(URL)]. If absent, return. Otherwise, let lastNVS be the value.
  3. Canonicalize URL by removing query parameters ignored in lastNVS, and stable sorting the query by key, if key order should be ignored. Let nvsCanonURL be the result.
  4. Access cache[nvsCanonURL]. If it is a usable response (Vary matches, NVS matches, is fresh, etc), use it and return.

This is a constant number of accesses to a hashtable (or filesystem directory, database index, etc) that usually does what is intended assuming servers do send only one non-empty value. (Clients could store more than one, or query all possible cache entries still, if desired.)

valenting commented 1 month ago

@jeremyroman Yes, this looks great IMO.

Thank you!

jeremyroman commented 1 month ago

Migrated to https://github.com/httpwg/http-extensions/issues/2908.