Closed bnewbold closed 1 year ago
Porting a convo from notion:
@bnewbold:
To be clear, you mean for things like mentions and URLs? I’d really go with graphemes. UTF-16 is almost worst than UTF-8 b/c all the euro languages will “just work” but then stuff like emoji and Bengali will be broken. The breakage would stem from applying href wrapper or formatting changes at codepoint boundaries, but splitting grapheme clusters. I would also question the frontend stuff being UTF-16. Won’t somebody eventually write native iOS and macOS apps in objective C or swift or dart or whatver? Much of the Android ecosystem is JVM (and thus, possibly, UTF-16), but not all of it
@dholms:
yup for mentions & urls and to clarify what i mean here (tho i think we’re on the same page), just that text slice indexes will be calculated off of utf-16 byte length. at the data layer it’ll still be stored as utf8 “every frontend” was definitely strong wording here. I should say “most frontends”. & I feel more comfortable sticking native devs with the overhead of remembering “utf-16” than web devs who are gonna forget about the library calculateStringSlice fn & just call .slice on the string oh interesting, just checked & in swift “string” is “extended grapheme clusters” in java it’s utf-16 code units this type of thing is why I’m a bit worried about grapheme clusters for text slices https://news.ycombinator.com/item?id=21693898
@bnewbold:
oof, yeah, good point. ok, i’m convinced that we should not specify it in graphemes, and should instead say that clients should be careful not to split graphemes when declaring ranges some of my concerns would/will also be covered by having nice test vectors and some kind of “protocol test suite” (or event test instance to point at) to surface this kind of thing early and often I do still feel like UTF-8 is the boring standard expected thing to specify. IIRC UTF-8 encoding is built-in to the JS standard library, and probably any language where UTF-16 is the norm. while in UTF-8 languages, it may require hunting down a third-party lib for UTF-16 support. fact check: actually UTF-16 seems to be at least partially supported in python, rust, and golang standard libraries I’m convinced about the grapheme stuff, and UTF-8 vs UTF-16 isn’t a hill I would die on
@dholms:
Yeah the grapheme stuff is just a little to flimsy for me I can be convinced on utf-8 pretty easily. basically my view point is: I agree that utf8 is the “right” choice, but so many web devs are gonna shoot themselves in the foot on this that I kinda just wanna ward it off & say utf-16
@bnewbold:
will a large fraction of web devs use our lexicon library, and can we add some helper routine to make it easier? I’m almost even tempted to say we should represent the text as a sequence of chunks with optional entity tags on each chunks, to avoid the issue, though it feels like that boat already sailed
@dholms:
yeah i think that boat has already sailed honestly it might be illustrative to check out paul’s code & see about wrapping this in a helper method 🤔 one thing to point out is that this is an application layer decision. stuff that’s directly handled by lex/atproto i feel totally happy using utf8 (max string length for instance). with something like this, we have to start jamming application layer semantics (rich text) into our lex code
@bnewbold:
one last thought I have on this is I think it would be basically the one single place we require UTF-16 handling in our entire protocol+application stack. while we do already require UTF-8 handling in a couple places. I don’t want to rabbit hole too deep on this though
Interested in @pfrazee's thoughts here:
I'm really torn but leaning towards utf8. If we add utf16, we now require support for three different text encodings to support our app.bsky stack 🙃
Utf8 is definitely the "right" choice technically, but I just know for a fact it's gonna lead to so many footguns & mistakes from webdevs that use javascript's built in string methods, doesn't matter how many helper methods we ship. Can maybe get around it if we ship a more involved "rich text framework" that helps with more than just character counting
The short is:
string.length
/string.slice
My preference order is
Okay here's where I'm at with this:
I've had sour experiences with the Unicode delimiter concept in the library world. There was a time when people would use non-printing ASCII characters (maybe it was 0x1E, record delimiter, the one you are proposing here?) to split author names for index sorting (eg, sometimes you want to split the Mc
off a surname for sorting; culture/region dependent). These special ASCII characters inevitably leaked through the whole ecosystem and continue to contaminate catalogs and cause problems today. It is just really easy for folks to directly copy the text around without striping the chars.
Yep that makes perfect sense, let's drop that idea.
On Mon, Mar 20, 2023 at 11:26 AM bnewbold @.***> wrote:
I've had sour experiences with the Unicode delimiter concept in the library world. There was a time when people would use non-printing ASCII characters (maybe it was 0x1E, record delimiter, the one you are proposing here?) to split author names for index sorting (eg, sometimes you want to split the Mc off a surname for sorting; culture/region dependent). These special ASCII characters inevitably leaked through the whole ecosystem and continue to contaminate catalogs and cause problems today. It is just really easy for folks to directly copy the text around without striping the chars.
— Reply to this email directly, view it on GitHub https://github.com/bluesky-social/atproto/issues/621#issuecomment-1476555927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWCU3CQLJCHDICPF6VWRLW5CALRANCNFSM6AAAAAAVOAMAVU . You are receiving this because you were mentioned.Message ID: @.***>
The entity text slice in posts is currently counted Typescript/Javascript style, which means UTF-16 encoding. We should consider counting by codepoints or even graphemes. The issue with counting in UTF-8 (or UTF-16, or technically even codepoints) is that when rendering the string could get sliced illegally, resulting in bugs or mangled output. By "illegally" I mean creating an invalid UTF-8 sub-string, or splitting a multi-codepoint grapheme (eg, a compound emoji) into something new (eg, two separate emoji) at the formatting boundary.
Horray!
Here is a concrete test case:
This is Rust code using the entity embeds on the post to highlight, but are off-by-a-couple because of smartquotes. The Typescript/Javascript code that generated this counted by UTF-16, while the Rust code is counting by UTF-8.
The record looks like:
The
text
field as UTF-8:Smartquotes!