Purge mechanisms - Githubissues

yoavweiss commented 5 years ago

While talking to @bratell-at-opera about signed exchanges, he raised concerns about invalid content continuing to circulate after its publisher realized it is invalid.

Thinking about that problem, it's very similar to the one for which CDNs offer purge mechanisms:

Articles may include typos, embarrassing mistakes or even false information that can expose the publisher from a legal perspective. Once that is discovered, publishers are likely to want to take down that content immediately.
E-commerce sites may include pricing mistakes, which they typically want to correct ASAP, as that content can cause them to lose money.
Court order compliance may also be a reason a content's publisher may want it to no longer be considered valid.

Talking to @jyasskin, the validityUrl can be used in order to verify that the content is still valid from the publisher's perspective. Although it was meant for intermediaries, we can use the same value for browser-side validation, to make sure the browser doesn't display invalid content.

The browser would fetch the validity information when navigating to an SXG page. (assuming the browser is online) If the SXG content is invalid, the browser would force its reload. Assuming that caches validate content regularly, very few users would actually witness those reloads, in the already rare case that content needed to be purged.

This won't solve purge for offline access, but that's similar to any offline content (e.g. in a PWA or a native app)

Thoughts?

sleevi commented 5 years ago

I think there's a tension here; the more work put on repudiation (effectively what this is), the less useful this becomes to actually enabling a distributed, decentralized, or more robust Internet, which is at least where some of the interesting use cases are. Similarly, this might be mistaken as enabling a DRM-like solution, which is not a goal.

Considering that these same issues already exist if, for example, a site sets a long cache lifetime, I think it may be worthwhile exploring more about the "what if we didn't offer a purge mechanism", as it seems like it comes with very, very high tradeoffs.

twifkak commented 5 years ago

In some sense, repudiation is already possible, by signing <script>fetchUpdatedContent()</script>. This would be soft repudiation if the original content is signed and then replaced at runtime, and hard repudiation if only the script is signed (in which case, none of those benefits of SXG are realized).

bratell-at-opera commented 5 years ago

One reason I pointed out the risks above is that information and misinformation is currently getting weaponized on the Internet, and for information there are two important parts: The data and the source. If you can replay a historic (may it be 1 day historic) mistake or transient data as current information from a reputable source, that can be used to either mislead people or damage the reputation of a source that is trying to be accurate. Or both.

Client caches don't have this particular weakness since you cannot transfer a cache to someone that you want to mislead.

jyasskin commented 5 years ago

The proposal would compromise the anti-censorship use cases if the censor can compel the origin to respond to validity-url requests without the signature field, or when the censor can MitM requests to the origin. If not, the fact that validity-url requests fail open prevents the censor from using that to censor. The proposal would compromise anti-surveillance use cases in general. It wouldn't compromise privacy-preserving-prefetch as long as we're careful to only fetch the validity-url after the navigation.

As @twifkak says, publishers can build a repudiation mechanism using Javascript even if we don't build this into the browser, so the main question is whether we want to build it in by default.

A goal to be able to repudiate inaccurate content conflicts, although maybe not fatally, with @ericlaw1979's desire to make this work for Archival.

I agree with @bratell-at-opera that SXGs are more of a risk than just long cache lifetimes, because they can be used maliciously, and if #300 goes toward trusting the SXG over the cache, that's not even persistently fixed by hitting reload, unlike the cache itself.

jyasskin commented 5 years ago

Fetching the validity-url also makes a stolen private key much more discoverable: the attacker either has to be on-path, or their attack gets exposed to the origin server as a fetch to some URL that can include whatever reporting we think is reasonable.

yoavweiss commented 5 years ago

Good points regarding the tension of such validity mechanisms between keeping access to the content secret and making sure the content is still valid.

From the publisher's side: Maybe it is worthwhile to add a new caching directive (e.g. "cache-but-validate") which would tell the UA if it should or shouldn't validate the content from the publisher's perspective. That way, a publisher that knows it provides data which is restricted in some parts of the world will avoid the validation phase, at a risk of persisting typos, and publishers that consider their content not sensitive will turn on validation to get better assurances against persistent typos or mistakes.

From the user's side: Maybe in some browser modes (e.g. "Privacy" or "incognito" modes), UAs can ignore above directives and skip validation altogether, as an intervention. (at the risk of serving potentially-invalid content)

sleevi commented 5 years ago

Are there other use cases besides supporting censorship and geoblocking? It seems developers already have the tools to accomplish that without introducing additional SXG features, so I’m wondering what’s missing?

bratell-at-opera commented 5 years ago

With SXG I can put on the face of someone else and replay something they have said/published. The problem I'm thinking of isn't the replaying operation (I think it's perfectly fine to archive and show old web pages), but that it will be attributed to the source even if the source has changed their mind.

There has been numerous cases from the last few years where an initial news report turned out to be incorrect and was later changed, updated or corrected. As news sources have come under attack from decision makers it has become so much more important that information is up to date. If a decision maker or hostile power can replay a news source' mistakes or early best-effort reporting as "current reporting" that can be used to undermine the credibility of reputable news organizations, which is dangerous in so many ways.

It might be that this feature would be too dangerous for a news source to use and they should stay away from it, but I don't trust everyone to fully understand the implications of using SXG, especially if it will be heavily promoted and use encouraged by portal sites (like those run by Google and Microsoft).

Maybe there are ways to handle this scenario that I'm missing but with what I understand it seems like SGX comes packaged with a big foot gun.

sleevi commented 5 years ago

It seems like that threat model is “Something was published without including subsequent corrections” - but that use case already seems accounted for in the design. That is, as has been pointed out, “responsible” organizations (which seems to be the presumption, given they subsequently issue corrections) can, for example, use JS to ping to see if there have been corrections. This doesn’t seem dissimilar from the same root cause as publishing a vulnerable (e.g. XSS) SXG.

However, I think the tension here - between privacy on one hand and having the “latest” edition, and censorship and being able to repudiate SXGs - is somewhat intentional. SXG opts for more privacy and censorship resistance, and that admittedly does come with trade offs.

bratell-at-opera commented 5 years ago

An alternative to not displaying the resource (what you categorize as censorship) would be to stop labeling it with the original source.

So instead of "foonews.com" it would be "foonews.com 3 June as per portalsite.com", but that seems to be opposite of what "portalsite.com" wants and I don't really see what UI design would want or allow such a complicated explanation. Or an interstitial "You are about to visit a revoked page, do you want to load the new version instead?". Might also not be the choice of a UI designer, but I don't see "refuse to display the page" (what you call censorship) as the only way to handle a page a site no longer wants to spread.

Then if the goal is that only portalsite.com should know who reads the pages (and have it be secret from foonews.com), then maybe portalsite.com can have some kind of proxy for checking whether a page has been revoked. Again, I'm not saying that would be the solution, but that there may be ways to fix the foot gun without losing any features.

sleevi commented 5 years ago

I think there’s a first step, which is agreeing whether or not it represents a problem. We should be very careful in designing functionality that is inherently anti-privacy (functionally, the scheme just reinvented online revocation checking, which most UAs find problematic). We’ve also identified multiple alternatives that don’t require a new primitive and achieve the same result. The argument for the new primitive seems to be that it will be a “foot gun” unless it can be actively blocked at will, and I’m not sure that’s a shared perspective. Have I missed why the multiple options publishers have, including not publishing an SXG, aren’t sufficient?

jyasskin commented 5 years ago

I like the idea of including the signing date in the UI somewhere. I'm not sure it will fit, but we should ask our UI folks. Given the signing date, maybe with an oldness threshold defined by the publisher, I don't think there's any need to also list the distributor.
"some kind of proxy for checking whether a page has been revoked" is exactly the validity-url. It merely gives you a fresher signing timestamp, not TLS's full liveness property.

ithinkihaveacat commented 5 years ago

If a decision maker or hostile power can replay a news source' mistakes or early best-effort reporting as "current reporting" that can be used to undermine the credibility of reputable news organizations, which is dangerous in so many ways. — @bratell-at-opera

To build on this thought a little bit: in a part of the world where all of foonews.com traffic is routed through a central point, SXG gives portalsite.com the ability to selectively downgrade part of foonews.com without disrupting access to the rest of it. So /sport might be downgraded, but /weather stays fresh. Since some pages are fresh, users may not notice that /sport is actually outdated.

(This "attack" is most effective if the CDN delivers SXG in which all links go back to the CDN itself (not the origin), but the CDN could require this for performance or availability reasons.)

Without SXG, a central point cannot disrupt access to a part of foonews.com--it can only block all of it, since it can't see what's being requested, or the content of individual URLs. (Similarly, portalsite.com could provide access to all of foonews.com with the exception of all articles that mention "bananas", which get a 404.)

It might be that this feature would be too dangerous for a news source to use and they should stay away from it but I don't trust everyone to fully understand the implications of using SXG…

Cautious origins will probably need to either not publish SXGs, or only publish to CDNs it trusts. Allowing any CDN to cache content seems risky (even though allowing anyone to copy content is usually a helpful anti-censorship technique).

jyasskin commented 5 years ago

@ithinkihaveacat Note that there's no way for the CDN to deliver an SXG such that all links go back to the CDN itself. Even after bundling ships, the CDN would have to convince users to manually fetch content via the CDN, perhaps by blocking direct connections.

ithinkihaveacat commented 5 years ago

@jyasskin There's not? I was thinking it would be possible, though via the mechanism of CDN policy, rather than a technical fix. e.g. a CDN will only cache content if all sub-resources are delivered from the CDN, and all links go back to the CDN. This would be faster for users, but would also lead users to think they're navigating around https://foonews.com, when they're actually on https://portalsite.com/s/foonews.com.

youennf commented 5 years ago

The WebKit team has similar concerns. Always getting back to the publisher to validate that the content is still valid seems a good approach in the context of a browser. This validation should not require any user credentials, which is an improvement over navigating to the publisher. If user is wary of doing such validation directly, user can ask its browser to do it through means like trusted third-party, Tor...

KenjiBaheux commented 5 years ago

It seems that there is no definitely correct behavior for this. There is a significant amount of users who would prefer one behavior over the other to fulfill their needs (e.g. stronger privacy).

This suggests that:

It'd be best resolved through a user preference or equivalent.
Different browser vendors may have different opinions about what constitutes a good default.
The default value could be the same for all users or could be segmented, i.e. markets where surveillance is a significant user concern vs. not.

(edit: I removed the reference to Issue 388 since the root concern is still being explored).

wimleers commented 5 years ago

Yesterday, it was announced that Google Chrome will be shipping this: https://webmasters.googleblog.com/2019/04/instant-loading-amp-pages-from-your-own.html. AFAICT this has not yet been addressed.

On the same day, Cloudflare announced support for Signed Exchanges: https://blog.cloudflare.com/announcing-amp-real-url/.

WICG / webpackage

Purge mechanisms #376