`@length` on strings should be Unicode-aware

kubukoz commented 8 months ago

I'll begin by saying that I don't have a dire need for this and it's not causing me any pain, but it's worth remembering about :)

According to the smithy spec, @length on strings should be applied to

The number of Unicode scalar values

whereas we currently just take .length of the string, which can be more than the length in codepoints (e.g. the 💀 emoji is 2 chars but only 1 codepoint).

I believe this should be fixed for maximum correctness: we should use the string.toCodePoints() method and other codepoint-related APIs (available in Java 1.8+, so it's fair game).

Worth noting: the Unicode spec says

Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

so some extra testing will be necessary, to ensure we don't count these high-surrogate and low-surrogate code points. I'm also new to these codepoint-related APIs so I don't know how tough this will be.

kubukoz commented 8 months ago

As this ticket has very little to do with smithy and smithy4s internals, and mostly requires writing a String => Int function, I think it's friendly to newcomers to the project.

kubukoz commented 8 months ago

here's what ChatGPT says:

val codePoints = input.codePoints()
  .filter(codePoint => codePoint < 0xD800 || codePoint > 0xDFFF)

val scalarValueCount = codePoints.count()

and some test cases from it too:

    Empty String:
    Input: "" (an empty string)
    Expected Output: 0 (no Unicode scalar values)

    String with Only BMP Characters:
    Input: "Hello" (a string with only characters within the BMP)
    Expected Output: 5 (5 Unicode scalar values, one for each character)

    String with High Surrogate Only:
    Input: "\uD83D" (a string with a high surrogate code point)
    Expected Output: 0 (no Unicode scalar values as it's an incomplete surrogate pair)

    String with Low Surrogate Only:
    Input: "\uDC34" (a string with a low surrogate code point)
    Expected Output: 0 (no Unicode scalar values as it's an incomplete surrogate pair)

    String with High and Low Surrogates:
    Input: "\uD83D\uDC68" (a string with both high and low surrogate code points forming a surrogate pair)
    Expected Output: 1 (1 Unicode scalar value representing the pair)

    String with Only Non-BMP Characters:
    Input: "\uD83D\uDE00\uD83D\uDE01\uD83D\uDE02" (a string with characters outside the BMP)
    Expected Output: 3 (3 Unicode scalar values representing each character)

    String with a Mix of BMP and Non-BMP Characters:
    Input: "Hello, 世界" (a string with a mix of BMP and non-BMP characters)
    Expected Output: 7 (7 Unicode scalar values in total)

    String with Random Characters:
    Input: "AΩ\uD835\uDC00@#123" (a string with a mix of characters, including BMP, non-BMP, and special characters)
    Expected Output: 8 (8 Unicode scalar values)

yisraelU commented 8 months ago

@kubukoz lol ChatGPT bot to open PR's coming soon

yisraelU commented 8 months ago

@kubukoz Interesting that they specify this because, Smithy doesn't allow Surrogates at all because the String in the Simple Model is a UTF-8 encoded String, which is all Codepoints. The issue is really with using a JVM String which is UTF-16.

That said, I am a bit unclear because your example and code seem to differ with regard to counting surrogate pairs at all vs filtering them out. I think one would need to convert the UTF-16 string to UTF-8 and then count codepoints

disneystreaming / smithy4s

`@length` on strings should be Unicode-aware #1260