When to fallback redirect?

irori commented 5 years ago

In the current loading spec, redirect to the fallback URL happens only when the signed exchange version is not supported.

However, Chromium implementation performs fallback redirect in the following cases (corresponding to the error codes here):

SXG was served from non-secure origin.
Unsupported version of SXG (could extract fallback URL).
SXG parse error (could extract fallback URL).
Network error occurred while loading SXG header.
Failed to fetch certificate chain.
Failed to parse certificate chain.
Signature verification failed.
Cert verification failed.
CT verification failed.
OCSP check failed.
Certificate Requirements aren't met.
SXG was served without X-Content-Type-Options: nosniff header.

Chromium doesn't fallbeck on the following errors:

SXG parse error (couldn't extract fallback URL).
Merkle integrity error.

We should clarify that which error should cause a fallback redirect, and spec them.

kinu commented 5 years ago

/cc @twifkak for consumer side feedback

twifkak commented 5 years ago

Thanks for opening the issue. Sorry for the delay; I've been sick and this slipped my radar.

As a consumer & distributor, my product preference is to make as many things as possible a fallback. That said, I understand there may be reasons not to prefer that, and I would like to know more about them:

Debugging -- as a developer, I won't even notice that my SXG is invalid unless I have DevTools Network tab open and I scroll all the way to the top. That said, it seems like the UA should prioritize the typical user's experience over developer experience.
What others?

From your list above, I think there are some that I would feel safe removing fallback for today:

SXG was served from non-secure origin.
Network error occurred while loading SXG header. (I assume this is a network error on the outer response?)
SXG was served without X-Content-Type-Options: nosniff header.

Essentially these are all the "outer exchange" things that are well-trodden and I feel confident we'll do correctly; also, they require that we generate the correct response rather than that we validate (but don't modify) a given response.

I'm also okay with not falling back for Merkle integrity (this one was simple enough to check) or parse error (fallback seems dangerous).

For a distributor with the privacy-preserving-prefetch use case, the downsides of having a network error page are:

The user sees an error page instead of simply reduced page speed & extra network bytes.
The error page is unbranded. (In the early days of AMP, there used to be more branded error pages, but my understanding is that these have significantly reduced in frequency due to improvements in search infra.).

The potential reasons that our cache may serve an invalid SXG include:

A known limitation in our implementation (i.e. we haven't implemented validation X yet).
An unknown difference (i.e. a bug in either our cache's or the UA's implementation of SXG validation).
An intrinsically unsolvable difference (e.g. it's impossible to know the UA's trusted roots, or UA clock jitter [skew changing significantly from one request to another]).

At this point, I'm mostly worried about the first two. But I'm mostly guessing. After #374 (and after we implement server-side monitoring of those reports), I'll be able to respond more usefully.

twifkak commented 5 years ago

(I'm presuming that the fallback for SXG version mismatch happens before the network error for MICE decode failure. Otherwise, future Chromium versions will need to continue supporting old MICE versions.)

jyasskin commented 5 years ago

Thank you for collating all the error cases so we can just go through them. My suggestion is:

Situation	Behavior	Notes
SXG was served from non-secure origin.	🛑Error
Unsupported version of SXG (could extract fallback URL).	↪️Fallback
SXG parse error (could extract fallback URL).	🛑Error
Network error while streaming SXG header or body.	🛑Error
Failed to fetch certificate chain.	🛑Error
Failed to parse certificate chain.	🛑Error
Signature parsing failed (This includes fetching and parsing the certificate)	🛑Error
Signature isn't valid	↪️Fallback	This could error instead on the last step, which actually runs the validation algorithm?
Certificate doesn't have the CanSignHttpExchanges OID	🛑Error	Maybe we shouldn't make this exception from the next step.
Certificate chain does not have a trusted leaf for the origin for another reason	↪️Fallback	This includes cert path building, OCSP checking, and CT checking.
Certificate chain does not establish cross-origin trust for another reason	🛑Error	The non-certificate reasons can be checked on the distributor.
SXG was served without X-Content-Type-Options: nosniff header.	🛑Error
SXG parse error (couldn't extract fallback URL).	🛑Error
Merkle integrity error.	🛑Error

I basically want anything the distributor could catch to be an Error, while anything that clients might vary on (mostly time & crypto validity) should fall back. @sleevi, how do you feel about this list?

sleevi commented 5 years ago

I basically want anything the distributor could catch to be an Error, while anything that clients might vary on (mostly time & crypto validity) should fall back.

At first glance, that sounds like a great principle and I think your analysis is reasonable and spot-on. Signature isn’t valid seems a reasonable error?

jyasskin commented 5 years ago

@sleevi https://wicg.github.io/webpackage/loading.html#validating-signature includes the check against the client's clock (which I think should definitely fall back) and the check that it's a secp256r1 key (where I want to have clients fall back when they remove broken crypto, but maybe we should do that with a version increment instead).

We should also make sure that the reporting API distinguishes between the cases that error vs fall back. Right now there are distinctions in the table above that aren't captured in the reporting API results.

sleevi commented 5 years ago

Yeah, the approach for changing crypto primitives was definitely seen as a version-bumpable thing. Perhaps it's worth splitting out "validate a signature" and "validate the timestamp" as two separate things, the former being an error, the latter being a fallback?

twifkak commented 5 years ago

Initial NEL analysis from AMP Cache is available to Googlers in http://b/126787670. Overall, error rate is relatively low, but I'd say not low enough to serve network errors in all cases.

The two most prevalent errors are cert_fetch_error & mi_error. My understanding is that these can be due to a bad connection, which is beyond the control of the distributor (e.g. may be due to ISP issues or spotty cell coverage). I'm thinking that these cases should continue to fall back. Otherwise, spotty connnections would disproportionately affect SXG, because browsers are willing to render unsigned HTML that was truncated or had failed subresources. Are there downsides to falling back in these cases?

sleevi commented 5 years ago

Cert Fetch Errors are more similar to failures to fetch AIA (which causes hardfail) or OCSP (for which behaviour varies between browsers), right?

As to whether it’s safe, the analysis would be:

Identifying all the information an adversary could learn by observing the fallback occurred
Identifying whether and how that information might already be available
Identifying how that information could be abused
Identifying what the process would be if we “got it wrong” in the analysis (in this case, what the implications would be if a browser had to suddenly switch to block)

irori commented 5 years ago

Reg. mi_error:

From the implementation's POV, browser sends SXG's inner response header to the renderer as soon as the SXG header is validated. Merkle integrity error is detected after that, so it's too late to chagne the response to 303 redirect at that pont.

jyasskin commented 5 years ago

cert_fetch_error: My instinct is that a failure to fetch the certificate from the same origin that's already transferring the signed exchange is similarly likely to a failure to fetch the signed exchange itself or the content from the publisher. Since we don't auto-reload on a network error, we shouldn't auto-redirect on one either.

mi_error:

It seems confusing to have a network error that truncates the SXG be expressed as mi_error. I think I'd want to see one of https://w3c.github.io/network-error-logging/#transmission-of-request-and-response-errors instead? @irori/@horo-t Is that plausible?
I hope the browser can render the parts of the content that have already been verified, even if the connection is broken which truncates and invalidates the last chunk.

irori commented 5 years ago

mi_error:

It seems confusing to have a network error that truncates the SXG be expressed as mi_error. I think I'd want to see one of https://w3c.github.io/network-error-logging/#transmission-of-request-and-response-errors instead? @irori/@horo-t Is that plausible?

IIUC, currently both a mi_error and a transmission error are reported. I think it's possible to stop sending mi_error for network error cases, but I'll defer to @horo-t.

I hope the browser can render the parts of the content that have already been verified, even if the connection is broken which truncates and invalidates the last chunk.

Chrome works in this way, although it's not guaranteed that response is loaded to the last validated chunk (internal buffer is discarded upon network error).

horo-t commented 5 years ago

mi_error:

It seems confusing to have a network error that truncates the SXG be expressed as mi_error. I think I'd want to see one of https://w3c.github.io/network-error-logging/#transmission-of-request-and-response-errors instead? @irori/@horo-t Is that plausible?

IIUC, currently both a mi_error and a transmission error are reported. I think it's possible to stop sending mi_error for network error cases, but I'll defer to @horo-t.

I agree. We should not send mi_error when caused by a transmission error. We need to update both the loading spec and Chromium implementation.

twifkak commented 5 years ago

Thanks for the clarifications; looks like I misunderstood what mi_error meant for the user. Re: cert_fetch_error; I'll defer to you all, unless I learn that these errors are disproportionately large and can't be mitigated at implementation side, and I can come up with a good reason they're different from AIA fetch. (The only difference I can think of is that a secure fallback seems possible in this case.)

WICG / webpackage

When to fallback redirect? #397