Feature request: Consider an UTF-16 code units length validator

LeoniePhiline commented 1 year ago

The current validator crate provides built-in validators for various use cases, but it lacks a validator for checking the length of a string based on its UTF-16 code units. This feature request proposes the addition of a UTF-16 code units length validator to the crate.

The motivation behind this request stems from the need to match the behavior of the HTML textarea maxlength attribute, which counts UTF-16 code units. To provide better consistency between frontend and backend validation, it would be useful to have a validator that directly checks the length of a string based on its UTF-16 code units.

The new validator could be used as follows:

use validator::Validate;

#[derive(Debug, Validate)]
struct MyStruct {
    #[validate(utf16_length(min = 1, max = "N"))]
    field: String,
}

Replace N with the desired UTF-16 code unit count. Use the same N for the HTML textarea.

N would be max bytes / 2, as UTF-16 code units are 2 bytes long.

This new validator would ensure that the code unit count limits are consistent between the HTML textarea and Rust, despite the different character encodings used, and avoid false negatives.

LeoniePhiline commented 1 year ago

Fixable by #245

Keats commented 1 year ago

I didn't realise it was used for the textarea... At this point I would rather than the length validator have a param for a mode: utf-16, utf-8, bytes etc rather than adding one new validator for each I think.

LeoniePhiline commented 1 year ago

This sounds quite ergonomic to me.

LeoniePhiline commented 1 year ago

Example inconsistency:

A textarea with minimum length of "20" (UTF-16 code units) would be satisfied by "🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽":

🖐🏽 = Two unicode code points:

🖐 (U+1F590, or as UTF-16 code units: 0xD83D 0xDD90)
🏽 (U+1F3FD, or as UTF-16 code units: 0xD83C 0xDFFD) .

Per each "🖐🏽": Two surrogate pairs, 4 code units. (Therefore 64 bytes.)

const emoji = "🖐🏽";
console.log(emoji.length); // 4

Sending this data as UTF-8 to a Rust backend and validating for the same length of 20 causes validation to fail.

For the Rust UTF-8 based validation to succeed, the emojis need to be doubled to a count of 10 to satisfy the validator: "🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽".

UTF-8 uses a varying (1-4) number of one-byte code units, depending on the encoded code point's unicode plane. Emoji use 4 UTF-8 code units (4 bytes) per code point.

Therefore, in UTF-8, each "🖐🏽" uses 8 code units for its two code points. (Therefore 8 bytes.)

As it turns out, the current length validator (using https://doc.rust-lang.org/std/str/struct.Chars.html) seems to count code points (not units), requiring 10 "🖐🏽", which is 10 x (🖐 + 🏽) = 20 code points, to satisfy a minimum length of 20.

Implementation: https://doc.rust-lang.org/src/core/str/count.rs.html

If it was counting code units, not code points (as I would have expected, TBH), then the length of "🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽🖐🏽" would be 80.

Of course it also does not count graphemes, otherwise 20 "🖐🏽" were required.

I had assumed both std::str::Chars (UTF-8) and DOMString (UTF-16) counted code units. As it turns out, DOMString counts UTF-16 code units, while std::str::Chars counts Unicode code points.

Implementation

Nevertheless, https://doc.rust-lang.org/std/primitive.str.html#method.encode_utf16 "returns an iterator of u16 over the string encoded as UTF-16." These u16 are obviously code units, which does match the DOMString counting behavior.

The correct implementation for enforcing string lengths consistent with HTML and JavaScript therefore appears to be to use value.encode_utf16.count(). (PS: As used in https://github.com/Keats/validator/pull/245)

Naming

Given that Rust - with glorious superiority - counts chars as code points, length in a valiator's name should primarily be considered as referring to code points. The UTF-16 variant, which does not (and should not) count code points, therefore should not simply be called length_utf16, but rather

either integrate code_units in its name,
or reference dom_string or web_string or browser_string in its name.

I would propose dom_string_length to refer to http://devdoc.net/web/developer.mozilla.org/en-US/docs/En/DOM/DOMString.html and http://devdoc.net/web/developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length.html . Otherwise dom_string_code_units fit.

LeoniePhiline commented 1 year ago

At this point I would rather than the length validator have a param for a mode: utf-16, utf-8, bytes etc rather than adding one new validator for each I think.

The length calculations differ in more than just the UTF variant:

The length validator counts Unicode code points, while value.encode_utf16().count() (and the HTML form validation) count code units.

A "param for a mode: utf-16, utf-8, bytes etc" would need to distinguish between code units and code points. I.e. you would need modes

"unicode-code-points", (= current length validator)
"utf-8-code-units", (ever needed at all?)
"utf-16-code-units", (= required to match HTML form data length / DOMString length)
"bytes"
etc.

Not sure if most of them have any common case for usage.

Therefore:

It might be more straightforward to add a specific validator for validating the length of HTML form input using UTF-16 code units.

Keats commented 1 year ago

It might be more straightforward to add a specific validator for validating the length of HTML form input using UTF-16 code units.

I'm not sure. Having stuff like unicode, utf-16, bytes cover 99.9% of what people need and you can explain in the documentation what they are actually counting. Duplicating the validator code that is going to be the same except for one place (the actual impl of the validator) feels bad

LeoniePhiline commented 7 months ago

Watchers of this issue might like to learn that the garde validator supports this feature in its recently released version 0.18.

https://github.com/jprochazk/garde/pull/88

Keats / validator