bluesky-social / atproto

Social networking technology created by Bluesky
Other
6.48k stars 452 forks source link

Post TextSlice string indexing is UTF-16 (JS default). What should it be in Lexicon refactor? #621

Closed bnewbold closed 1 year ago

bnewbold commented 1 year ago

The entity text slice in posts is currently counted Typescript/Javascript style, which means UTF-16 encoding. We should consider counting by codepoints or even graphemes. The issue with counting in UTF-8 (or UTF-16, or technically even codepoints) is that when rendering the string could get sliced illegally, resulting in bugs or mangled output. By "illegally" I mean creating an invalid UTF-8 sub-string, or splitting a multi-codepoint grapheme (eg, a compound emoji) into something new (eg, two separate emoji) at the formatting boundary.

Horray!


Here is a concrete test case:

2023-03-02T16:51:15,208785579-08:00

This is Rust code using the entity embeds on the post to highlight, but are off-by-a-couple because of smartquotes. The Typescript/Javascript code that generated this counted by UTF-16, while the Rust code is counting by UTF-8.

The record looks like:

    "record": {
      "$type": "app.bsky.feed.post",
      "createdAt": "2023-03-02T23:48:38.303Z",
      "entities": [
        {
          "index": {
            "end": 40,
            "start": 20
          },
          "type": "mention",
          "value": "did:plc:nsbnrfcsymcsgmtbh7g5yjdy"
        },
        {
          "index": {
            "end": 103,
            "start": 92
          },
          "type": "link",
          "value": "https://bsky.social"
        }
      ],
      "reply": {
        "parent": {
          "cid": "bafyreigzmarz6njvhy2va774eyk7kdjqa6uexca73knyiqv4huywcmyvn4",
          "uri": "at://did:plc:oky5czdrnfjpqslsw2a5iclo/app.bsky.feed.post/3jpyhrfjyek2c"
        },
        "root": {
          "cid": "bafyreiapuidi7rdhrhgy2whcvxlgj2m265ouwow3uakinpv7lspev5dg3a",
          "uri": "at://did:plc:nsbnrfcsymcsgmtbh7g5yjdy/app.bsky.feed.post/3jpyhhpffhc2l"
        }
      },
      "text": "So right now you’re @gabriel.bsky.social — you can keep that name, which we gave you on the bsky.social domain we own. But let’s say you own the domain “gabriel.com”. Then you can become @gabriel.com on here too."
    },

The text field as UTF-8:

b'So right now you\xe2\x80\x99re @gabriel.bsky.social \xe2\x80\x94 you can keep that name, which we gave you on the bsky.social domain we own. But let\xe2\x80\x99s say you own the domain \xe2\x80\x9cgabriel.com\xe2\x80\x9d. Then you can become @gabriel.com on here too.'

Smartquotes!

dholms commented 1 year ago

Porting a convo from notion:

@bnewbold:

To be clear, you mean for things like mentions and URLs? I’d really go with graphemes. UTF-16 is almost worst than UTF-8 b/c all the euro languages will “just work” but then stuff like emoji and Bengali will be broken. The breakage would stem from applying href wrapper or formatting changes at codepoint boundaries, but splitting grapheme clusters. I would also question the frontend stuff being UTF-16. Won’t somebody eventually write native iOS and macOS apps in objective C or swift or dart or whatver? Much of the Android ecosystem is JVM (and thus, possibly, UTF-16), but not all of it

@dholms:

yup for mentions & urls and to clarify what i mean here (tho i think we’re on the same page), just that text slice indexes will be calculated off of utf-16 byte length. at the data layer it’ll still be stored as utf8 “every frontend” was definitely strong wording here. I should say “most frontends”. & I feel more comfortable sticking native devs with the overhead of remembering “utf-16” than web devs who are gonna forget about the library calculateStringSlice fn & just call .slice on the string oh interesting, just checked & in swift “string” is “extended grapheme clusters” in java it’s utf-16 code units this type of thing is why I’m a bit worried about grapheme clusters for text slices https://news.ycombinator.com/item?id=21693898

@bnewbold:

oof, yeah, good point. ok, i’m convinced that we should not specify it in graphemes, and should instead say that clients should be careful not to split graphemes when declaring ranges some of my concerns would/will also be covered by having nice test vectors and some kind of “protocol test suite” (or event test instance to point at) to surface this kind of thing early and often I do still feel like UTF-8 is the boring standard expected thing to specify. IIRC UTF-8 encoding is built-in to the JS standard library, and probably any language where UTF-16 is the norm. while in UTF-8 languages, it may require hunting down a third-party lib for UTF-16 support. fact check: actually UTF-16 seems to be at least partially supported in python, rust, and golang standard libraries I’m convinced about the grapheme stuff, and UTF-8 vs UTF-16 isn’t a hill I would die on

@dholms:

Yeah the grapheme stuff is just a little to flimsy for me I can be convinced on utf-8 pretty easily. basically my view point is: I agree that utf8 is the “right” choice, but so many web devs are gonna shoot themselves in the foot on this that I kinda just wanna ward it off & say utf-16

@bnewbold:

will a large fraction of web devs use our lexicon library, and can we add some helper routine to make it easier? I’m almost even tempted to say we should represent the text as a sequence of chunks with optional entity tags on each chunks, to avoid the issue, though it feels like that boat already sailed

@dholms:

yeah i think that boat has already sailed honestly it might be illustrative to check out paul’s code & see about wrapping this in a helper method 🤔 one thing to point out is that this is an application layer decision. stuff that’s directly handled by lex/atproto i feel totally happy using utf8 (max string length for instance). with something like this, we have to start jamming application layer semantics (rich text) into our lex code

@bnewbold:

one last thought I have on this is I think it would be basically the one single place we require UTF-16 handling in our entire protocol+application stack. while we do already require UTF-8 handling in a couple places. I don’t want to rabbit hole too deep on this though

dholms commented 1 year ago

Interested in @pfrazee's thoughts here:

I'm really torn but leaning towards utf8. If we add utf16, we now require support for three different text encodings to support our app.bsky stack 🙃

Utf8 is definitely the "right" choice technically, but I just know for a fact it's gonna lead to so many footguns & mistakes from webdevs that use javascript's built in string methods, doesn't matter how many helper methods we ship. Can maybe get around it if we ship a more involved "rich text framework" that helps with more than just character counting

dholms commented 1 year ago

The short is:

My preference order is

AaronGoldman commented 1 year ago

https://hsivonen.fi/string-length/

pfrazee commented 1 year ago

Okay here's where I'm at with this:

bnewbold commented 1 year ago

I've had sour experiences with the Unicode delimiter concept in the library world. There was a time when people would use non-printing ASCII characters (maybe it was 0x1E, record delimiter, the one you are proposing here?) to split author names for index sorting (eg, sometimes you want to split the Mc off a surname for sorting; culture/region dependent). These special ASCII characters inevitably leaked through the whole ecosystem and continue to contaminate catalogs and cause problems today. It is just really easy for folks to directly copy the text around without striping the chars.

pfrazee commented 1 year ago

Yep that makes perfect sense, let's drop that idea.

On Mon, Mar 20, 2023 at 11:26 AM bnewbold @.***> wrote:

I've had sour experiences with the Unicode delimiter concept in the library world. There was a time when people would use non-printing ASCII characters (maybe it was 0x1E, record delimiter, the one you are proposing here?) to split author names for index sorting (eg, sometimes you want to split the Mc off a surname for sorting; culture/region dependent). These special ASCII characters inevitably leaked through the whole ecosystem and continue to contaminate catalogs and cause problems today. It is just really easy for folks to directly copy the text around without striping the chars.

— Reply to this email directly, view it on GitHub https://github.com/bluesky-social/atproto/issues/621#issuecomment-1476555927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWCU3CQLJCHDICPF6VWRLW5CALRANCNFSM6AAAAAAVOAMAVU . You are receiving this because you were mentioned.Message ID: @.***>

dholms commented 1 year ago

closed https://github.com/bluesky-social/atproto/pull/658