WICG / scroll-to-text-fragment

Proposal to allow specifying a text snippet in a URL fragment
Other
586 stars 43 forks source link

Please reconsider the use of collator-based search #233

Open hsivonen opened 11 months ago

hsivonen commented 11 months ago

Step 6.1 of find a range from a node list has this sentence: “The string search must be performed using a base character comparison, or the primary level, as defined in [UTS10].” Followed by the note: “Intuitively, this is a case-insensitive search also ignoring accents, umlauts, and other marks.”

There are multiple problems with this:

In Chrome, the meaning of links depends on the UI language

Consider the link https://hsivonen.com/test/moz/text-fragment-language.html#:~:text=H%C3%A4n . Test this in Chrome with the UI language set to English, German, and Finnish. A different word (in a Finnish part of the text) is highlighted in the three different UI languages.

Also test the link https://hsivonen.com/test/moz/text-fragment-language.html#:~:text=ILIK in Chrome with the UI language set to English and Turkish. A different word is highlighted depending on the UI language.

Obviously, it’s very bad for the meaning of links to depend on the UI language!

The current situation is not interoperable

Safari (tested 16.6 on Ventura) appears to perform a case-insensitive search that doesn’t ignore diacritics and does not appear to depend on either the UI language or the language of the content (i.e. the case mapping is wrong for languages that use Turkish i).

The use case isn’t well-motivated

Why should the search be ignoring certain aspects of the text? This seems unnecessary when the text fragments are generated by copying concrete text from the page (by plain copy-paste, by using Chrome’s context menu item “Copy link to highlight”, or by a server-side system having loaded the target page for server-side processing). This seems to mainly cater to typing text fragments by hand when the text input method in use isn’t well matched with the page text (e.g. using U.S. English keyboard to try to match English text that has French-style accents like naïve or café). What use case justifies the resulting complexity?

The spec doesn’t actually say what Chrome does

The definition is too vague to actually use collator-based search. The spec doesn’t say how to choose what collation to use. The referenced spec, UTS 10, specifies DUCET, but no browser carries an implementation of raw DUCET. All browsers use CLDR collations instead. Chrome seems to be using the CLDR search collation for the UI language of the browser (as with cmd/ctrl-f).

Distinction from cmd/ctrl-f

Using the search collation for the UI language makes sense for cmd/ctrl-f, since the text is user-entered, so it’s reasonable to align locale-dependent behavior to the locale of the user as opposed to the locale of the page. However, links should mean the same thing when shared between people who use different UI languages.

Speccing the root collation doesn’t make sense

The whole point of collator-based search is being able to make what accent-insensitive and case-insensitive mean reuse the collation data for language-dependent meanings. The meaning of case-insensitive changes for Turkish i. A bunch of European Latin-script languages treat a letter with what English analyzes as an accent as a base letter on its own right.

Saying that the CLDR search root collation shall always be used would make things always wrong for the languages that were supposed to be the beneficiaries of collator-based search.

Collator-based search doesn’t deal with multiple language spans in the haystack

Collator-based search maps the text for both the needle and the haystack into collation elements, performs the search over collation elements, and then tries to correlate the result back to the original text.

This requires the use of the same mapping from text to collation elements for both the needle and the haystack. In practice, the allocation of tailored primary weights in a language-specific tailoring is scoped to that tailoring, so collator-based search isn’t suited for handling spans of different languages in the haystack.

Creating a Web-exposed dependency on a complex operation

The spec as written brings collator-based search into the scope of the Web Platform when previously collator-based search hasn’t been a standardized part of the Web Platform. (WebKit and Blink implement their cmd/ctrl-f functionality via collator-based search but Firefox doesn’t. window.find exposes this to the Web, but the API remains unstandardized.)

Notably, while Intl.Collator exposes search collation data, it does so only in the context of providing a sorting API. Intl.Collator doesn’t provide a search API, so it can only be used for testing full-string matches–not substring or (properly) even prefix matches. (I think it’s a design error for Intl.Collator to expose search data.)

Collator-based search means shipping ICU4C or undertaking major implementation effort

Before implementing the collator for ICU4X, I surveyed the use cases of collation in Firefox and determined that certain capabilities of ICU4C’s collator aren’t exposed to the Web or otherwise needed by Firefox. (These include: collator-based search, generating search keys that a database can store after up-front computation and then use multiple times with a computationally lighter comparison operation, and generating an alphabetical index as seen in the Contacts apps on mobile phones with items grouped under exemplar characters–typically the recitable alphabet of the locale.)

I wish to keep the option open for Firefox to migrate from ICU4C to ICU4X without bringing collator-based search into scope as a blocker feature to implement.

There’s a scope document that contains the rationale for excluding collator-based search from ICU4X.


I suggest revisiting case-insensitivity and accent-insensitivity in light of use cases and reassessing if addressing use cases that would call for case-insensitivity or accent-insensitivity are really worth the complexity.

I am not advocating in favor of case-insensitivity, but I note that implementing case-insensitivity on top of the Unicode database’s notion of fold case would involve much less complexity than involving a collator in any way. Unlike in Safari, it could even be sensitive to Turkish i based on the language declared for each node in text content without making things infeasible. Still, even that level of complexity would need to be justified by use cases.

Accent-insensitivity is big enough a deal to do well that the use cases would need to be extraordinarily compelling.

Collator-based search conceptually is insensitive to the Unicode Normalization Form. However, in practice the Web in is Normalization Form C, and in any case generating the URL fragments by copying whatever text the page happens to have doesn't seem to justify making the comparison insensitive to the Unicode Normalization Form.

bokand commented 11 months ago

@hsivonen: This is very thorough and convincing, thank you for writing it up!

I'll confess that I'm not an expert in this area so I didn't appreciate some of the nuances back when I was considering these issues. IIRC our primary motivation here was reusing the find-in-page machinery and so we worked backwards to try and specify (loosely, it turns out) what that was doing. But your point about why Ctrl-F find should work differently from text-directives makes sense to me. Definitely agree that having UI language change the meaning of a link is wrong - I hadn't realized the collator depended on the UI language!

I'll double-check with consumers of this API internally to make sure I'm not missing anything but I believe we'd be ok removing the collator-based search.

As for case-sensitivity...I'll have a think on it but as long as it doesn't make links more brittle I don't feel strongly about it. Given that Safari also does this I wonder if they have an opinion here. +@annevk

annevk commented 11 months ago

I think we'd be okay with changing that. As Henri notes, it's not important for the use cases.

fantasai commented 11 months ago

As for case-sensitivity...I'll have a think on it but as long as it doesn't make links more brittle I don't feel strongly about it.

The important thing is to be referencing the source DOM and not take text-transform into account. ;) However some implementations copy-paste the result of text-transform, though, which is perhaps inconvenient here... Note spec, test

bokand commented 11 months ago

text-transform is actually the edge-case I had in mind about why case insensitivity might be useful: if a user wants to create a text directive it'd be confusing if they copy/paste some text from the page and it doesn't work.

That said, I assume hand-rolling links is pretty rare (I do it sometimes but I'm probably atypical here:) and over text-transform (or other source<-->rendered transforms) even more so and this isn't an issue when generated programmatically (if done correctly by the UA) so perhaps not worth worrying about.

bokand commented 9 months ago

I think everyone's in agreement that we should remove collator-based search. We're also ok to make the search case-sensitive so I'll start working on spec-changes in this direction.

However some implementations copy-paste the result of text-transform, though, which is perhaps inconvenient here...

@fantasai - the copy-paste example I'm more worried about right now is whitespace, since DOM nodes will include whitespace that's collapsed in the rendered output and not included in copy/paste. I'm worried even programmatic tools will get this wrong and server-side systems are almost certain to do some basic normalization; I'd guess collapsing whitespace collapsing would be a common one.

I need to familiarize myself a bit more with the relevant specs first (any pointers would be appreciated) so don't have great ideas yet, but I wonder if it'd make sense to perform the search over some processed form of text rather than the source DOM (e.g. what's seen via innerText maybe?)

annevk commented 9 months ago

I suspect we want to essentially use window.find() still, but toggle some aspects of it.

Having said that, I haven't double checked how it's implemented today, but my preference would be that this would not be a completely novel code path.

annevk commented 9 months ago

I see, the specification already has the whitespace issue: #98. And it seems like window.find() has that "issue" as well, at least in Chromium and WebKit.

So what has Chromium implemented if it's not in the specification?

zcorpan commented 9 months ago

text-transform is actually the edge-case I had in mind about why case insensitivity might be useful: if a user wants to create a text directive it'd be confusing if they copy/paste some text from the page and it doesn't work.

Chromium seems to use the transformed casing rather than the source casing when creating a text fragment URL from the context menu. Demo: https://software.hixie.ch/utilities/js/live-dom-viewer/#:~:text=LIVE%20DOM%20VIEWER,-Markup%20to%20test

That seems like an issue if we move to case-sensitive matching.

Since styling can change for various reasons (e.g. viewport size), it seems more robust to match against the non-transformed text. That also seems in line with the following text-transform spec text:

This property transforms text for styling purposes. It has no effect on the underlying content, and must not affect the content of a plain text copy & paste operation.

https://drafts.csswg.org/css-text/#text-transform

bokand commented 9 months ago

Chrome performs the search on the text-as-rendered (i.e. post text-transform, whitespace collapsing and any other transforms that may be performed). I wasn't sure how to refer to this in spec-land so I left a note and #98 at the time.

I agree it's bad for the result to depend on style so ideally we'd match on the pre-text-transform text. But I'm worried about whitespace handling if we were to simply match the DOM node's CharacterData:

WDYT of searching across post-whitespace-collapsed text but before text-transform is applied (are there any other ways input text gets changed to be rendered?). Is there some way in spec to refer to this state? Alternatively we could spec the search to perform some string trimming...

That also seems in line with the following text-transform spec text:

This property transforms text for styling purposes. It has no effect on the underlying content, and must not affect the content of a plain text copy & paste operation.

Interesting that innerText (at least as implemented) includes the result of the text-transform...wonder if that's an intentional choice?

zcorpan commented 9 months ago

Re innerText, that's defined in https://html.spec.whatwg.org/multipage/dom.html#rendered-text-collection-steps

I think text-transform was specified because that's how it was implemented. https://github.com/rocallahan/innerText-spec/issues/2 is the only issue I found which touches on text-transform but only in relation to ::first-line.

In any case, the innerText algorithm seems to me it's close enough to what is needed here that we could reuse it (with a flag for text-transform if needed). It doesn't pierce into shadow trees as specified, though.

hsivonen commented 8 months ago

I suspect we want to essentially use window.find() still, but toggle some aspects of it.

That makes sense in the context of how window.find() works in Gecko (Unicode Database-based characters to characters transform) but not in the context of window.find() in Blink and WebKit. Blink and WebKit use collator-based search (with collation data determined by the UI locale) for window.find(), so delegating to window.find() in Blink and WebKit would be contrary to what this issue is asking for. (See the comment that I left on the issue about window.find() at the time when I filed this issue.)

(I don't know what pre-processing window.find() does in Blink and WebKit before showing the data to the collator. Reusing that preprocessing could make sense.)

bokand commented 8 months ago

In any case, the innerText algorithm seems to me it's close enough to what is needed here that we could reuse it (with a flag for text-transform if needed). It doesn't pierce into shadow trees as specified, though.

Re: shadow trees, there's a question in #190 whether the search should pierce. I think it should but I'm currently measuring how often this happens in practice, would appreciate additional opinions there. The search is currently specified as a tree walk that collects a search string from text nodes. We could keep the shadow piercing property by keeping the tree walk and collecting the innerText of leaf elements.

One issue that we'd have to resolve is how to map a match in the innerText string back to a DOM range

so delegating to window.find() in Blink and WebKit would be contrary to what this issue is asking for.

Would it make sense to spec a shared "find text" algorithm with knobs needed to differentiate behavior among its callers? This algorithm would be a common place that defines how the search text is collected, what's considered "visible", etc. (i.e. much of what this spec does). e.g. The text directive search would call "find text with collator-based set to false" while window.find could call "find text with collator-based set to true" (if we decide that's how it should work, or if it's needed for web-compat).

@annevk is this essentially what you meant?

annevk commented 8 months ago

@bokand yeah that is what I meant.

I think it has to pierce shadow trees from a UX perspective. We have to be careful though that if we ever add some kind of API that doesn't result in shadow trees being exposed.

srl295 commented 6 months ago

Whether 'accents, umlauts, and other marks' are significant depends on language. Perhaps what should be considered is only the normalization of the text (which I didn't see mention of), since as was noted, the spec is citing existing text on a page, not possible text on the page.