`z.string().max()` (and `min()`) don't count unicode characters

jrandolf commented 8 months ago

This is more of a feature request than a bug. Currently string().max() (and other length validators) count the UTF-16 length of a string rather than the number of unicode characters. The latter can be calculated using [...value].length where value is the desired string, so a custom transformer can work, but this would probably benefit most users of this library seeing as the length people usually care about in validation is the number of unicode characters (also called Unicode Scalars), not the UTF-16 length.

We could either option this out, e.g. .max(5, {char: true}), or use a separate method, e.g. .charMax(5).

maurer2 commented 8 months ago

Hello, I think this would be a good idea. This approach works well for simple unicode characters, but it doesn't seem to work correctly for more complex emojis like this one for example: 🧑‍🍼 It doesn't work correctly because of modifier characters for skin colour or ZWJ characters for gender modifiers that were added later to unicode. See here: https://dev.to/ayc0/intlsegmenter-dont-use-stringsplit-nor-stringlength-dh6

const  stringLength = [...'🧑‍🍼'].length;

// stringLength = 3

One can however use the Intl.Segmenter api for this, e.g.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const stringLength = [...segmenter.segment('🧑‍🍼')].length;

// stringLength = 1

Unfortunately Intl.Segmenter isn't supported in Firefox yet, but it seems to be coming to FF soon. 🙌 Once it is available in all browser it would probably be a good idea to add it as an option as described above.

jrandolf commented 8 months ago

@maurer2 Unicode graphemes are too coarse for this issue. For most applications, the goal is to determine the number of unicode scalars because unicode scalars are the smallest valid unicode unit type and it's understood without any context. Graphemes and other higher-level constructs are just groups of unicode scalars that create a representation suitable for a specific application.

This comment doesn't imply measuring graphemes is not useful. In the context of UIs, it certainly useful, but we should fix the most simple case which is the one defined in this issue.

(It should also be noted that many RFCs are formulated around unicode scalars (e.g. https://datatracker.ietf.org/doc/html/rfc6532#section-3.4). Also, see definition 3 of https://www.unicode.org/glossary/#character)

ultrox commented 6 months ago

Hey @jun-sheaf thanks for opening discussion, however I'm being potentially block because of this issues. I would like to introduce zod to our code base and even though it passed internal scanning, person I work with called this issue and I quote "Particularly nasty".

I'm currently doing research but based on my understanding particular length of input is affected if special characters exceeds single byte, like emoji for example. Not only that, if particular need arises there is way to write custom length checker which would then check character length. That being said, I quite literally never had any problems with lengths' working in only western dominated part of world.

[...'🐟'].length // 1

"😀".length // 2

I got worried for a sec because I though umlauts are encoded with more bytes, but checking it out, they are single-byte encoded. I would really appreciate your view on this one.

CleanShot 2024-05-01 at 12 02 32

jrandolf commented 6 months ago

@ultrox The umlaut can be written separately as a character ([https://en.wikipedia.org/wiki/Umlaut_(diacritic)][]) which the browser will then combine with the next character, so you are not necessarily safe. It becomes one character once you perform unicode normalization.

AlisCode commented 4 months ago

Hello! Yes, this is a problem indeed. The documentation here is misleading :

z.string().min(5, { message: "Must be 5 or more characters long" });
z.string().max(5, { message: "Must be 5 or fewer characters long" });
z.string().length(5, { message: "Must be exactly 5 characters long" });

Do note that characters are explicitly mentioned, and I think the sane default that everyone expects is that max(5) means '5 characters maximum' indeed . This means that I expect the following test to pass :

const schema = z.string().max(5);
expect(schema.safeParse("abcde").success).toEqual(true);
expect(schema.safeParse("😀😀😀😀😀").success).toEqual(true);

Currently this is not the case.

What is the next step for this - can I open a PR to fix this ? I believe that changing this behaviour would be a breaking change.

subvertallchris commented 2 months ago

Hello, we encountered this today as well. Has anyone found a good workaround?

AlisCode commented 2 months ago

For reference,

Joi solves this with a parameter to specify the encoding
Yup has the same issue as Zod https://github.com/jquense/yup?tab=readme-ov-file#stringminlimit-number--ref-message-string--function-schema (they use value.length in the implementation, so UTF-16 encoding)

I think at least the possibility to specify the encoding would be good, but I dont really think it would solve the confusion

tats-u commented 1 month ago

We need all of max() (UTF-16), maxCodePoints() (UTF-32), maxBytes() (UTF-8), maxGraphemes() (Grapheme clusters). Valibot has all except for UTF-32.

password: z.string().minGraphemes(8).maxBytes(72),

colinhacks / zod

`z.string().max()` (and `min()`) don't count unicode characters #3355