Post TextSlice string indexing is UTF-16 (JS default). What should it be in Lexicon refactor?

bnewbold commented 1 year ago

The entity text slice in posts is currently counted Typescript/Javascript style, which means UTF-16 encoding. We should consider counting by codepoints or even graphemes. The issue with counting in UTF-8 (or UTF-16, or technically even codepoints) is that when rendering the string could get sliced illegally, resulting in bugs or mangled output. By "illegally" I mean creating an invalid UTF-8 sub-string, or splitting a multi-codepoint grapheme (eg, a compound emoji) into something new (eg, two separate emoji) at the formatting boundary.

Horray!

Here is a concrete test case:

2023-03-02T16:51:15,208785579-08:00

This is Rust code using the entity embeds on the post to highlight, but are off-by-a-couple because of smartquotes. The Typescript/Javascript code that generated this counted by UTF-16, while the Rust code is counting by UTF-8.

The record looks like:

    "record": {
      "$type": "app.bsky.feed.post",
      "createdAt": "2023-03-02T23:48:38.303Z",
      "entities": [
        {
          "index": {
            "end": 40,
            "start": 20
          },
          "type": "mention",
          "value": "did:plc:nsbnrfcsymcsgmtbh7g5yjdy"
        },
        {
          "index": {
            "end": 103,
            "start": 92
          },
          "type": "link",
          "value": "https://bsky.social"
        }
      ],
      "reply": {
        "parent": {
          "cid": "bafyreigzmarz6njvhy2va774eyk7kdjqa6uexca73knyiqv4huywcmyvn4",
          "uri": "at://did:plc:oky5czdrnfjpqslsw2a5iclo/app.bsky.feed.post/3jpyhrfjyek2c"
        },
        "root": {
          "cid": "bafyreiapuidi7rdhrhgy2whcvxlgj2m265ouwow3uakinpv7lspev5dg3a",
          "uri": "at://did:plc:nsbnrfcsymcsgmtbh7g5yjdy/app.bsky.feed.post/3jpyhhpffhc2l"
        }
      },
      "text": "So right now you’re @gabriel.bsky.social — you can keep that name, which we gave you on the bsky.social domain we own. But let’s say you own the domain “gabriel.com”. Then you can become @gabriel.com on here too."
    },

The text field as UTF-8:

b'So right now you\xe2\x80\x99re @gabriel.bsky.social \xe2\x80\x94 you can keep that name, which we gave you on the bsky.social domain we own. But let\xe2\x80\x99s say you own the domain \xe2\x80\x9cgabriel.com\xe2\x80\x9d. Then you can become @gabriel.com on here too.'

Smartquotes!

dholms commented 1 year ago

Porting a convo from notion:

@bnewbold:

To be clear, you mean for things like mentions and URLs? I’d really go with graphemes. UTF-16 is almost worst than UTF-8 b/c all the euro languages will “just work” but then stuff like emoji and Bengali will be broken. The breakage would stem from applying href wrapper or formatting changes at codepoint boundaries, but splitting grapheme clusters. I would also question the frontend stuff being UTF-16. Won’t somebody eventually write native iOS and macOS apps in objective C or swift or dart or whatver? Much of the Android ecosystem is JVM (and thus, possibly, UTF-16), but not all of it

@dholms:

yup for mentions & urls and to clarify what i mean here (tho i think we’re on the same page), just that text slice indexes will be calculated off of utf-16 byte length. at the data layer it’ll still be stored as utf8 “every frontend” was definitely strong wording here. I should say “most frontends”. & I feel more comfortable sticking native devs with the overhead of remembering “utf-16” than web devs who are gonna forget about the library calculateStringSlice fn & just call .slice on the string oh interesting, just checked & in swift “string” is “extended grapheme clusters” in java it’s utf-16 code units this type of thing is why I’m a bit worried about grapheme clusters for text slices https://news.ycombinator.com/item?id=21693898

@bnewbold:

oof, yeah, good point. ok, i’m convinced that we should not specify it in graphemes, and should instead say that clients should be careful not to split graphemes when declaring ranges some of my concerns would/will also be covered by having nice test vectors and some kind of “protocol test suite” (or event test instance to point at) to surface this kind of thing early and often I do still feel like UTF-8 is the boring standard expected thing to specify. IIRC UTF-8 encoding is built-in to the JS standard library, and probably any language where UTF-16 is the norm. while in UTF-8 languages, it may require hunting down a third-party lib for UTF-16 support. fact check: actually UTF-16 seems to be at least partially supported in python, rust, and golang standard libraries I’m convinced about the grapheme stuff, and UTF-8 vs UTF-16 isn’t a hill I would die on

@dholms:

Yeah the grapheme stuff is just a little to flimsy for me I can be convinced on utf-8 pretty easily. basically my view point is: I agree that utf8 is the “right” choice, but so many web devs are gonna shoot themselves in the foot on this that I kinda just wanna ward it off & say utf-16

@bnewbold:

will a large fraction of web devs use our lexicon library, and can we add some helper routine to make it easier? I’m almost even tempted to say we should represent the text as a sequence of chunks with optional entity tags on each chunks, to avoid the issue, though it feels like that boat already sailed

@dholms:

yeah i think that boat has already sailed honestly it might be illustrative to check out paul’s code & see about wrapping this in a helper method 🤔 one thing to point out is that this is an application layer decision. stuff that’s directly handled by lex/atproto i feel totally happy using utf8 (max string length for instance). with something like this, we have to start jamming application layer semantics (rich text) into our lex code

@bnewbold:

one last thought I have on this is I think it would be basically the one single place we require UTF-16 handling in our entire protocol+application stack. while we do already require UTF-8 handling in a couple places. I don’t want to rabbit hole too deep on this though

dholms commented 1 year ago

Interested in @pfrazee's thoughts here:

I'm really torn but leaning towards utf8. If we add utf16, we now require support for three different text encodings to support our app.bsky stack 🙃

Utf8 is definitely the "right" choice technically, but I just know for a fact it's gonna lead to so many footguns & mistakes from webdevs that use javascript's built in string methods, doesn't matter how many helper methods we ship. Can maybe get around it if we ship a more involved "rich text framework" that helps with more than just character counting

dholms commented 1 year ago

The short is:

utf8: lingua franca. standard way of representing strings, doing rich text formatting, etc. not supported natively in JS
utf16: supported natively in js & java. the default for clueless devs that just call string.length/string.slice
graphemes: default in swift. pretty good & most intuitive, but has some flimsiness/corner cases that are ill-defined. would work well in 99% of situations. But also has the problems of utf8 where it requires a library.

My preference order is

utf8
utf16 (but close)
graphemes (quite far)

AaronGoldman commented 1 year ago

https://hsivonen.fi/string-length/

pfrazee commented 1 year ago

Okay here's where I'm at with this:

UTF8. This would be nice for consistency but it'll require a utility library for JS which uses TextEncoder/TextDecorder/ArrayBuffer. I can live with this, but we're going to have to get in devs' faces about the necessity of using it and it'll come off as weird and complicated.
UTF16. In terms of how the programming language ecosystem works, this is the most convenient choice. AFAIK most other languages doing UIs have more sophisticated awareness of encodings and so they'd be able to handle UTF16 more easily than JS can handle UTF8. It is super weird, however, to make devs convert these strings from UTF8 (json's encoding) to UTF16 to properly handle these indices.
Graphemes. This is a non-starter because graphemes are not standard enough, AFAIK.
Array of objects. It's not entirely off the table that we switch to arrays of objects that represent spans of text and indicate their active marks/facets. This is how most richtext editors represent their data. The upside is that this resolves the indexing question entirely. The downside -- and why I didn't do this originally -- is it makes the maxLength calculation impossible unless we make this richtext representation a Lexicon primitive. I'm open to that too.
Unicode delimiters. This one's a bit out there, but one option would be to leverage unicode control characters like information separator one and two to demarcate ranges. We'd then have the facets indicate which range they're using by a single number. The ranges would need to never overlap and while this could be a feasible solution, I don't think I'm enough of a unicode expert to predict how it would play out so I'm mainly dropping this in to see if anybody else is that expert.

bnewbold commented 1 year ago

I've had sour experiences with the Unicode delimiter concept in the library world. There was a time when people would use non-printing ASCII characters (maybe it was 0x1E, record delimiter, the one you are proposing here?) to split author names for index sorting (eg, sometimes you want to split the Mc off a surname for sorting; culture/region dependent). These special ASCII characters inevitably leaked through the whole ecosystem and continue to contaminate catalogs and cause problems today. It is just really easy for folks to directly copy the text around without striping the chars.

pfrazee commented 1 year ago

Yep that makes perfect sense, let's drop that idea.

On Mon, Mar 20, 2023 at 11:26 AM bnewbold @.***> wrote:

I've had sour experiences with the Unicode delimiter concept in the library world. There was a time when people would use non-printing ASCII characters (maybe it was 0x1E, record delimiter, the one you are proposing here?) to split author names for index sorting (eg, sometimes you want to split the Mc off a surname for sorting; culture/region dependent). These special ASCII characters inevitably leaked through the whole ecosystem and continue to contaminate catalogs and cause problems today. It is just really easy for folks to directly copy the text around without striping the chars.

— Reply to this email directly, view it on GitHub https://github.com/bluesky-social/atproto/issues/621#issuecomment-1476555927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWCU3CQLJCHDICPF6VWRLW5CALRANCNFSM6AAAAAAVOAMAVU . You are receiving this because you were mentioned.Message ID: @.***>

dholms commented 1 year ago

closed https://github.com/bluesky-social/atproto/pull/658

bluesky-social / atproto

Post TextSlice string indexing is UTF-16 (JS default). What should it be in Lexicon refactor? #621