Why only match at word boundaries?

Treora commented 4 years ago

The current draft specification only allows text matches to occur at word boundaries (see section §3.5.3). As clarified in example 12:

The substring "mountain range" is word bounded within the string "An impressive mountain range" but not within "An impressive mountain ranger".

As for the reason, the only hint in the spec is at the top of the section:

Limiting matching to word boundaries is one of the mitigations to limit cross-origin information leakage.

Given the various other mitigations against cross-origin information leakage (in section §3.4) that appear to rule out automated, repeated pattern searches, are there still scenarios where this is an important issue? And if so, would limiting text to word boundaries really make a significant difference?

Even if both questions would be answered with yes, may it be possible and desirable to somewhat separate the concerns of expressivity and information leakage — perhaps having one algorithm that defines what fragment is being pointed at, while another defines whether this is a permissible target to scroll to? This would give flexibility and leave browsers free to not apply the limitation in situations where it can determine that there is no risk of an attack (e.g. in documents without external resources).

Are there other reasons for imposing this limitation?

Downsides of word boundaries

I came up with at least a few reasons against this limitation, besides the obvious aspect of complicating the algorithm:

Language specificity: As acknowledged in the spec, not every language/script has an equally clear a concept of word boundaries. Even if the Intl.Segmenter will be standardised and make a definition easier, it could be asked if in every language the reasoning behind the word boundary limitation is equally valid (depends on what exactly is the reasoning is).
The limitation inhibits many valid use cases; I imagine a web-based spell checker service may want to point you at an error in a word. Or in the many documents that are not prose but other strings, there could be various reasons to point at fragments of a string, e.g. a sequence inside a chromosome.
Browsers and/or other applications could (should?) enable users to create links to selections, but the user experience for link creation would be confusing if a user can target some selections, but not others; or if alternatively the selection would be changed automagically to end at word boundaries.
As most other selection/targeting/annotation softwares do not seem to limit selections this way, it could hamper conversions between formats (for example converting a highlight in a pdf file into a link to its equivalent html page). Also I imagine that other formats will some day consider adopting linking to text fragments, without the limitation because they have no issues of information leakage, and thereby create a subtle inconsistency that may puzzle both developers and users.

Drop word boundary limitation after prefix and before suffix?

Update: this has been done in #148

Even if the word boundary limitation would be retained for information leakage, I fail to see why one would require the prefix to end at a word boundary, or the suffix to start at a boundary. In case a prefix is given, I would expect only the concatenation prefix + textStart to be word bounded, rather than both individually, because the combination is what is being matched. And likewise for (textEnd ?? textStart) + suffix. This way, one could still link to e.g. a typo in a word by giving the rest of the word as its context (see the example in the demo I made: :~:text=poi-,i,-nt).

Using the prefix/suffix strings to allow arbitrary within-a-word selection, instead of only to disambiguate between multiple occurrences, would greatly increase the expressivity of the text fragment directive, and thereby remove the link creation UX issue as one can always generate a directive that describes the user’s selection.

Note that one could still choose to allow (just not require) whitespace after the prefix and before the suffix, as is currently the case. And it would be helpful to still allow prefix/suffix to be in the previous/next block element; as is required in another example in my demo (that selects a block element’s whole content).

…and/or between textStart and textEnd?

Yet another example in my demo shows a scenario where it would be helpful if one can point at a long uninterrupted string without having to quote the whole string. Even if boundaries around the match are still required, it could be considered to allow textStart to end and textEnd to begin without a boundary when both are present.

… or altogether?

(apologies for this lengthy polemic; among the many aspects of the spec that seem well thought-through, the reasoning behind this one just kept puzzling me while implementing it!)

Treora commented 4 years ago

…am I the only person with an opinion on this? :)

After opening this issue, I was reminded that even our latin script was written without word boundaries for many centuries. And it appears that many languages are still written in such scriptio continua, or only started separating words relatively recently. Do all these languages have well-defined, deterministic ways to detect where words start and end? And if they do not, as the notes in tr29 suggest, will these languages either not support text fragments smaller than perhaps a whole sentence, or allow selecting any characters and thereby forego the intended (security?) benefit that the word boundary requirement provides? I may be biased and would gladly be shown wrong, but it makes me wonder if this word boundary requirement could end up being another one of those short-sighted design decisions made by well-meaning westerners for what is supposed to be a world-wide web.

bokand commented 4 years ago

Sorry for the delay, was AFK for most of this month.

...are there still scenarios where this is an important issue? And if so, would limiting text to word boundaries really make a significant difference?

We've done what we can to prevent cross-origin leakage in general; however, given very specific circumstances it may be possible for some page to leak a single bit. Though repeated attempts should be blocked by user gestures it's always possible there might be browser bugs (or the user might be tricked into repeatedly providing gestures).

The word boundary is meant to protect against exfiltrating unknown tokens, e.g. a password or a bank balance, by brute-forcing a search one character at a time. The thing we were worried about is an attacker finding some way to tell that a navigation successfully invoked the fragment on a sensitive page. If they can find a way to repeat the navigation on (e.g. a "show PIN" page) they could, in a few tries, read out a sensitive token.

I agree it seems excessive but we wanted to make sure we start by erring towards caution. It's much easier to loosen restrictions than to tighten them. I think this is something that we could look at lifting in the future but I think it's still a bit early (i.e. will security researchers find lots of ways to exfiltrate tokens? defeat user-gesture requirements?). I also don't want to move too quickly without any other browser vendors having expressed implementation interest.

may it be possible and desirable to somewhat separate the concerns of expressivity and information leakage — perhaps having one algorithm that defines what fragment is being pointed at, while another defines whether this is a permissible target to scroll to? This would give flexibility and leave browsers free to not apply the limitation in situations where it can determine that there is no risk of an attack (e.g. in documents without external resources).

This sounds like a positive change to me, though I'm unlikely to have the bandwidth in the near term to make it. I'd be happy to review a PR though.

Even if the word boundary limitation would be retained for information leakage, I fail to see why one would require the prefix to end at a word boundary, or the suffix to start at a boundary...one could still link to e.g. a typo in a word by giving the rest of the word as its context (see the example in the demo I made: :~:text=poi-,i,-nt).

This is quite clever! Yes, I think this is just an oversight on our part - if the prefix+match+suffix is word bounded then the match need not be.

it could be considered to allow textStart to end and textEnd to begin without a boundary when both are present.

Potentially...though this would allow brute-forcing a token if you know what follows it. e.g. "Balance: 3,141,592 USD" could be searched in this case by using "USD" as textEnd and varying textStart.

Treora commented 3 years ago

Update: with #148 merged, this word boundary issue has become less severe. At least one can point at some characters inside a word now. I will update the polyfill I made, and hope Chromium will reflect the change soon too(?).

For clarity: given that various of the above concerns still hold, I would nevertheless still advocate to consider dropping the word boundary restriction altogether, or at least turning it into an optional restriction for risky cases (but I’d think that the prevention of automated repeated attempts would already suffice in those cases).

bokand commented 3 years ago

No guarantee but I'll see if I can land a change in Chromium for the M88 branch point.

bokand commented 3 years ago

Bah, I missed the M88 branch point by a few hours. Anyway, I made the changes to Blink to make it match the updated spec, it'll go out in M89.

bokand commented 10 months ago

Cleaning up old issues...

I don't think there's anything actionable left here. I think the word boundary restriction is a fundamental security protection so don't have any plans to change that. That said, this could be taken up in a broader standards venue once this makes it's way into the HTML spec.

WICG / scroll-to-text-fragment