Open kubukoz opened 8 months ago
As this ticket has very little to do with smithy and smithy4s internals, and mostly requires writing a String => Int
function, I think it's friendly to newcomers to the project.
here's what ChatGPT says:
val codePoints = input.codePoints()
.filter(codePoint => codePoint < 0xD800 || codePoint > 0xDFFF)
val scalarValueCount = codePoints.count()
and some test cases from it too:
Empty String:
Input: "" (an empty string)
Expected Output: 0 (no Unicode scalar values)
String with Only BMP Characters:
Input: "Hello" (a string with only characters within the BMP)
Expected Output: 5 (5 Unicode scalar values, one for each character)
String with High Surrogate Only:
Input: "\uD83D" (a string with a high surrogate code point)
Expected Output: 0 (no Unicode scalar values as it's an incomplete surrogate pair)
String with Low Surrogate Only:
Input: "\uDC34" (a string with a low surrogate code point)
Expected Output: 0 (no Unicode scalar values as it's an incomplete surrogate pair)
String with High and Low Surrogates:
Input: "\uD83D\uDC68" (a string with both high and low surrogate code points forming a surrogate pair)
Expected Output: 1 (1 Unicode scalar value representing the pair)
String with Only Non-BMP Characters:
Input: "\uD83D\uDE00\uD83D\uDE01\uD83D\uDE02" (a string with characters outside the BMP)
Expected Output: 3 (3 Unicode scalar values representing each character)
String with a Mix of BMP and Non-BMP Characters:
Input: "Hello, 世界" (a string with a mix of BMP and non-BMP characters)
Expected Output: 7 (7 Unicode scalar values in total)
String with Random Characters:
Input: "AΩ\uD835\uDC00@#123" (a string with a mix of characters, including BMP, non-BMP, and special characters)
Expected Output: 8 (8 Unicode scalar values)
@kubukoz lol ChatGPT bot to open PR's coming soon
@kubukoz Interesting that they specify this because, Smithy doesn't allow Surrogates at all because the String in the Simple Model is a UTF-8 encoded String, which is all Codepoints. The issue is really with using a JVM String which is UTF-16.
That said, I am a bit unclear because your example and code seem to differ with regard to counting surrogate pairs at all vs filtering them out. I think one would need to convert the UTF-16 string to UTF-8 and then count codepoints
I'll begin by saying that I don't have a dire need for this and it's not causing me any pain, but it's worth remembering about :)
According to the smithy spec,
@length
on strings should be applied towhereas we currently just take
.length
of the string, which can be more than the length in codepoints (e.g. the 💀 emoji is 2 chars but only 1 codepoint).I believe this should be fixed for maximum correctness: we should use the
string.toCodePoints()
method and other codepoint-related APIs (available in Java 1.8+, so it's fair game).Worth noting: the Unicode spec says
so some extra testing will be necessary, to ensure we don't count these high-surrogate and low-surrogate code points. I'm also new to these codepoint-related APIs so I don't know how tough this will be.