fabian-hiller / valibot

The modular and type safe schema library for validating structural data 🤖
https://valibot.dev
MIT License
6.33k stars 204 forks source link

maxCodePoints / minCodePoints (UTF-32 code points) #875

Open tats-u opened 1 month ago

tats-u commented 1 month ago

The length limit of VARCHAR in some RDBs is the number of UTF-32 code points. maxLength counts an emoji and some kanji as two.

Password requirements by NIST:

https://pages.nist.gov/800-63-3/sp800-63b.html

Unicode [ISO/ISC 10646] characters SHOULD be accepted as well. To make allowances for likely mistyping, verifiers MAY replace multiple consecutive space characters with a single space character prior to verification, provided that the result is at least 8 characters in length. Truncation of the secret SHALL NOT be performed. For purposes of the above length requirements, each Unicode code point SHALL be counted as a single character.

This requires we should count an emoji (not compounded ones) or other 4-byte chracters as 1 character in a password.

fabian-hiller commented 1 month ago

You can use our new grapheme actions to count emojis that we added in v1.0.0-beta.1: https://github.com/fabian-hiller/valibot/releases/tag/v1.0.0-beta.1

tats-u commented 1 month ago

@fabian-hiller

new grapheme actions

The number of UTF-16/32 code points per grapheme is unlimited. You should combine maxGraphemes with this maxCodePoints or maxLength.

https://stackoverflow.com/questions/71011343/maximum-number-of-codepoints-in-a-grapheme-cluster

tats-u commented 1 month ago

If you write your backend in Go or Rust, UTF-32 length is commoner than UTF-16. (utf8.RuneCountInString(str) or str.chars().count())

fabian-hiller commented 1 month ago

Thank you for your detailed feedback! How would you implement such an action? We also have byte actions like maxBytes but not sure if this is what you are looking for.

tats-u commented 1 month ago

We can implement it based on the existing maxLength. Compare the result of codePointAt per character with 0x10000 and move the cursor forward by one more character if necessary.

You can combine maxBytes with others too. For a password, it can't be longer than 72 bytes if you hash it by bcrypt. It's compatible with maxCodePoints or maxLength.

fabian-hiller commented 1 month ago

Can you provide a code example for the if-statement to check the maximum code points?

tats-u commented 1 month ago

Do you mean this?

https://github.com/fabian-hiller/valibot/blob/e3e87366158c4d2a3aaebe65056372c416836d1b/library/src/utils/_getCodePointCount/_getCodePointCountCount.ts#L14-L20