WICG / scroll-to-text-fragment

Proposal to allow specifying a text snippet in a URL fragment
Other
586 stars 43 forks source link

Please consider an alternative mitigation mechanism to word segmention #251

Open hsivonen opened 8 months ago

hsivonen commented 8 months ago

Whether a text fragment matches is dependent of how the text of the target page segments into words. This is framed as a mitigation to limit cross-origin information leakage.

While I’ve seen the intent not to change this, it seems problematic to make addressing on the Web dependent on an operation that is ambiguous and known to differ between implementations. As a spec-level matter, “a more sophisticated algorithm should be used based on the locale” is not well-defined, and as a practical matter segmentation of the Thai, Khmer, Lao, and Burmese scripts is an area of active development. While ICU4C uses dictionary-based segmentation for these scripts, ICU4X already uses LSTM models for segmenting these scripts, and there’s development activity also on the ICU4C side in this direction. Moreover, it seems that Chromium uses different Khmer and Lao dictionaries than the ICU4C upstream.

Different results between different implementations seem acceptable for use cases such as double-clicking to select a word or CSS line breaking (there are font and view port size differences anyway), but one would hope Web addressing to work more consistently across implementations.

The spec doesn’t characterize how well word segmentation works as an information leakage mitigation, so it’s unclear how the upsides weigh against the downside of potential unreliable matching with scripts that don’t use spaces between words.

Could a sufficient mitigation be achieved by requiring the matchable parts of the fragment to have a minimum length? (With some accommodation for scripts that have a particularly high information density per Scalar Value. See below.)

Is this mitigation layer really necessary if User Activation is required for cross-origin text fragments to be honored? Wouldn’t that make automated scanning so rate-limited as to be infeasible?

More info:

hsivonen commented 8 months ago

CC @fantasai

sffc commented 8 months ago

I’ve been told that there are some breaks that a dictionary misses but LSTM finds and vice versa. Unfortunately, I don’t have concrete examples at hand.

I posted some examples here: https://github.com/unicode-org/lstm_word_segmentation/issues/25

bokand commented 8 months ago

Thanks for the detailed explanations!

Is this mitigation layer really necessary if User Activation is required for cross-origin text fragments to be honored? Wouldn’t that make automated scanning so rate-limited as to be infeasible?

Most of the mitigations we added were in terms of defense-in-depth. I believe the Chrome Security team felt there was sufficient risk here that having multiple redundant mitigations was warranted. i.e. Finding a user activation bypass (has happened in the past) or bug in the scroll-to-text implementation, shouldn't lead to a general XS-Search capability. @arturjanc @shhnjk could probably provide a better perspective from the security side of things. Curious also if there security folks from other vendors who could weigh in?

Could a sufficient mitigation be achieved by requiring the matchable parts of the fragment to have a minimum length? (With some accommodation for scripts that have a particularly high information density per Scalar Value. See below.)

I'm not sure this would sufficiently address the brute-forcing risk as secret information often appears beside known text, e.g.

Your one time password: SECRET

An attacker could use known text to pad their search term to an arbitrary length.

I do appreciate that the existing behavior is unfortunate both from a practical and spec standpoint so would be happy to find something better...maybe there's some additional properties that would get us there? e.g. typically a setup like the above would place the SECRET into a separate element:

<span>Your one time password:<strong>SECRET</strong></span>

Perhaps requiring each match+context term to have a minimum length within each node? Maybe something like that could work?

bokand commented 8 months ago

While I’ve seen the intent not to change this...

FWIW - I don't have a strong inclination to keep word boundaries - my intent on that issue was mainly that we can't simply remove it without a replacement and it wasn't clear to me there were any suggested alternatives...

As far as alternatives go, another option: given this is about preventing repeated character-by-character exfiltrations, perhaps usage limits on a per-document or per-domain might be a sufficient replacement? e.g. limit invocations to ~10 per minute? That should mostly not affect real usage while making attacks less feasible, I think?

I vaguely recall considering this when we were first looking into it but don't remember if there was an issue with limits like this vs. the word boundary just seemed like a better choice at the time...

arturjanc commented 8 months ago

I do think this is a legitimate issue, but I'm worried that it will be difficult to find a solution that addresses the segmentation concern and upholds the security properties of STTF.

The problem is that we know that it's possible for STTF to leak data from cross-origin documents under certain conditions, without relying on the existence of additional browser bugs. For example, if a page with sensitive data has an iframe that can be controlled by an attacker (e.g. an ad, etc.), it will be possible for the attacker to abuse STTF to learn information about text rendered close to the iframe (through the use of Intersection Observer, which will tell the attacker how much of the frame has been rendered, revealing whether the scroll to chosen text happened or not). This is just one example -- there are other situations where a cross-origin attacker might learn about whether STTF triggered, revealing information about the presence of a string chosen by the attacker on the target page.

Because of this, matching on the word boundary serves an important purpose of mitigating character-by-character exfiltration of sensitive strings (e.g. access tokens, account recovery codes, etc.)

It seems difficult to draw a line that would allow matching with more than word-level granularity and still prevent this type of XS-Leaks: sensitive tokens can be quite short (sometimes 4 or 6 characters), which makes rate limiting largely ineffective. Per-document and per-domain limits are easy to bypass because after triggering STTF the attacker can simply navigate to a different document under their control, possibly hosted on a different domain (but looking the same as the original document, which will make this difficult to spot by the user). As @bokand pointed out minimum length restrictions won't help if there is a known prefix before a sensitive string; and the approach from https://github.com/WICG/scroll-to-text-fragment/issues/251#issuecomment-1858558783 seems too optimistic to me because there will be cases where a sensitive string is present in the page without markup that would separate it from preceding data.

Basically, my guess is that, unfortunately, none of the ideas we're discussing here will actually significantly mitigate the exfiltration concerns which spurred the introduction of the word segmentation restriction in the first place. It seems reasonable to think of alternative solutions, but in the absence of a robust mitigation I'd be quite wary of removing the word boundary restriction.