Please consider an alternative mitigation mechanism to word segmention

hsivonen commented 8 months ago

Whether a text fragment matches is dependent of how the text of the target page segments into words. This is framed as a mitigation to limit cross-origin information leakage.

While I’ve seen the intent not to change this, it seems problematic to make addressing on the Web dependent on an operation that is ambiguous and known to differ between implementations. As a spec-level matter, “a more sophisticated algorithm should be used based on the locale” is not well-defined, and as a practical matter segmentation of the Thai, Khmer, Lao, and Burmese scripts is an area of active development. While ICU4C uses dictionary-based segmentation for these scripts, ICU4X already uses LSTM models for segmenting these scripts, and there’s development activity also on the ICU4C side in this direction. Moreover, it seems that Chromium uses different Khmer and Lao dictionaries than the ICU4C upstream.

Different results between different implementations seem acceptable for use cases such as double-clicking to select a word or CSS line breaking (there are font and view port size differences anyway), but one would hope Web addressing to work more consistently across implementations.

The spec doesn’t characterize how well word segmentation works as an information leakage mitigation, so it’s unclear how the upsides weigh against the downside of potential unreliable matching with scripts that don’t use spaces between words.

Could a sufficient mitigation be achieved by requiring the matchable parts of the fragment to have a minimum length? (With some accommodation for scripts that have a particularly high information density per Scalar Value. See below.)

Is this mitigation layer really necessary if User Activation is required for cross-origin text fragments to be honored? Wouldn’t that make automated scanning so rate-limited as to be infeasible?

More info:

Firefox 122 (currently in Nightly) ships the ICU4X 1.4 segmenter and makes it available via Intl.Segmenter, so a non-ICU4C implementation is now readily available for experimentation. (Also, considering what the Chromium ICU README says about Khmer and Lao, there's likely an opportunity to find differences between Safari and Chrome for Khmer and Lao.)
Compared to using dictionaries, using LSTM results is smaller data size and quantitatively more accurate breaks for the majority languages of these scripts. I’ve been told that there are some breaks that a dictionary misses but LSTM finds and vice versa. Unfortunately, I don’t have concrete examples at hand.
To avoid imposing a length requirement that would rule out legitimate use cases for the Han script in particular, merely counting Scalar Values is unlikely to be good enough. See https://hsivonen.fi/string-length/ . It seems to me that currently the word break requirement tends to snap at increments of two Hanzi. East Asian Width is a measure that addresses the Han script and is already defined by Unicode and would count two Hanzi as having the same width as 4 ASCII characters, but it would still count non-CJK (semi-)syllabaries as one width unit per scalar. It would be possible to create a custom measure that looked up the script for each scalar, counted a scalar as 2 units if the script is on a list of information-dense scripts and as 1 unit otherwise. (Such a list would need to be created.)

hsivonen commented 8 months ago

CC @fantasai

sffc commented 8 months ago

I’ve been told that there are some breaks that a dictionary misses but LSTM finds and vice versa. Unfortunately, I don’t have concrete examples at hand.

I posted some examples here: https://github.com/unicode-org/lstm_word_segmentation/issues/25

bokand commented 8 months ago

Thanks for the detailed explanations!

Is this mitigation layer really necessary if User Activation is required for cross-origin text fragments to be honored? Wouldn’t that make automated scanning so rate-limited as to be infeasible?

Most of the mitigations we added were in terms of defense-in-depth. I believe the Chrome Security team felt there was sufficient risk here that having multiple redundant mitigations was warranted. i.e. Finding a user activation bypass (has happened in the past) or bug in the scroll-to-text implementation, shouldn't lead to a general XS-Search capability. @arturjanc @shhnjk could probably provide a better perspective from the security side of things. Curious also if there security folks from other vendors who could weigh in?

Could a sufficient mitigation be achieved by requiring the matchable parts of the fragment to have a minimum length? (With some accommodation for scripts that have a particularly high information density per Scalar Value. See below.)

I'm not sure this would sufficiently address the brute-forcing risk as secret information often appears beside known text, e.g.

Your one time password: SECRET

An attacker could use known text to pad their search term to an arbitrary length.

I do appreciate that the existing behavior is unfortunate both from a practical and spec standpoint so would be happy to find something better...maybe there's some additional properties that would get us there? e.g. typically a setup like the above would place the SECRET into a separate element:

<span>Your one time password:<strong>SECRET</strong></span>

Perhaps requiring each match+context term to have a minimum length within each node? Maybe something like that could work?

bokand commented 8 months ago

While I’ve seen the intent not to change this...

FWIW - I don't have a strong inclination to keep word boundaries - my intent on that issue was mainly that we can't simply remove it without a replacement and it wasn't clear to me there were any suggested alternatives...

As far as alternatives go, another option: given this is about preventing repeated character-by-character exfiltrations, perhaps usage limits on a per-document or per-domain might be a sufficient replacement? e.g. limit invocations to ~10 per minute? That should mostly not affect real usage while making attacks less feasible, I think?

I vaguely recall considering this when we were first looking into it but don't remember if there was an issue with limits like this vs. the word boundary just seemed like a better choice at the time...

arturjanc commented 8 months ago

I do think this is a legitimate issue, but I'm worried that it will be difficult to find a solution that addresses the segmentation concern and upholds the security properties of STTF.

The problem is that we know that it's possible for STTF to leak data from cross-origin documents under certain conditions, without relying on the existence of additional browser bugs. For example, if a page with sensitive data has an iframe that can be controlled by an attacker (e.g. an ad, etc.), it will be possible for the attacker to abuse STTF to learn information about text rendered close to the iframe (through the use of Intersection Observer, which will tell the attacker how much of the frame has been rendered, revealing whether the scroll to chosen text happened or not). This is just one example -- there are other situations where a cross-origin attacker might learn about whether STTF triggered, revealing information about the presence of a string chosen by the attacker on the target page.

Because of this, matching on the word boundary serves an important purpose of mitigating character-by-character exfiltration of sensitive strings (e.g. access tokens, account recovery codes, etc.)

It seems difficult to draw a line that would allow matching with more than word-level granularity and still prevent this type of XS-Leaks: sensitive tokens can be quite short (sometimes 4 or 6 characters), which makes rate limiting largely ineffective. Per-document and per-domain limits are easy to bypass because after triggering STTF the attacker can simply navigate to a different document under their control, possibly hosted on a different domain (but looking the same as the original document, which will make this difficult to spot by the user). As @bokand pointed out minimum length restrictions won't help if there is a known prefix before a sensitive string; and the approach from https://github.com/WICG/scroll-to-text-fragment/issues/251#issuecomment-1858558783 seems too optimistic to me because there will be cases where a sensitive string is present in the page without markup that would separate it from preceding data.

Basically, my guess is that, unfortunately, none of the ideas we're discussing here will actually significantly mitigate the exfiltration concerns which spurred the introduction of the word segmentation restriction in the first place. It seems reasonable to think of alternative solutions, but in the absence of a robust mitigation I'd be quite wary of removing the word boundary restriction.

WICG / scroll-to-text-fragment

Please consider an alternative mitigation mechanism to word segmention #251