bluesky-social / atproto

Social networking technology created by Bluesky
Other
6.9k stars 490 forks source link

Add fast path skipping UTF8 length counting #2819

Open gaearon opened 2 months ago

gaearon commented 2 months ago

Stacked on https://github.com/bluesky-social/atproto/pull/2817

Commits

What

Similar to https://github.com/bluesky-social/atproto/pull/2817, I'm trying to avoid calling into TextEncoder().encode(str).byteLength for every string. After this change, I basically don't hit it in the app at all — the fast path always lets me out early.

The fast pass itself is pretty general. The idea is that .length counts UTF-16 code units, and each UTF-16 code unit corresponds to at most 3 bytes in UTF-8 encoding. So we can safely use value.length * 3 as an upper bound on what utf8Len(value) could possibly be. If this upper bound is below the minLength, the same is true for utf8Len. If this upper bound is within maxLength, the same is true for utf8Len.

Why * 3?

So .length * 3 should always give us a valid upper bound. But this needs a look from an expert.

I've added some test cases.

bnewbold commented 2 months ago

this seems reasonable, though I should probably re-read more carefully and maybe cook up more corner-cases. I kind of suspect that it won't be as much of a win as the earlier grapheme cluster and utf8 caching patch though? I guess UTF-16 to UTF-8 does cost something through, and this probably does help with the happy path, and we do a lot of these, hrm.