reverse-string: Add unicode tests

senekor commented 8 months ago

BethanyG commented 8 months ago

I like this test case, but feel as though we should also be including cases for scripts/words that do not already have normalized combined forms. This case can be passed in Python without any extra processing, because these two ideographs are a pre-combined form and Python 'sees' them each as a single code point.

猫" --> U+29483 '子' --> U+23376

On the other hand, this string has to be normalized to NFC (forcing the pre-combined forms) or checked for combining marks:

aînée

That being said, I don't have objections to adding the first test case. 😄

senekor commented 8 months ago

@BethanyG I'm not sure I understand your concern. What is a "normalized combined form" and an "ideograph"? The vocabulary around unicode is sometimes confusing to me.

To add some context the test case "子猫" is there to prevent the string from being treated as a byte array, because these characters are multiple bytes in UTF-8. I expect this test case to be easy in any language that has a notion of unicode built-in.

The second test case "Würstchenstand" contains what I usually see referred to as a "grapheme cluster" where two unicode scalar values combine to one human-readable character. Note that ü can be represented as a single scalar value (UTF-8 bytes: [195, 188]) or the regular ascii u followed by a diaeresis ◌̈ (UTF-8 bytes: [117, 204, 136]).

For exmale, a simple solution with [::-1] results in the diaeresis being on the r:

>>> "Würstchenstand"[::-1]
'dnatsnehctsr̈uW'

BethanyG commented 8 months ago

@senekor - apologies, I am probably too far down a personal rabbit hole of Unicode documentation to communicate effectively at the moment! 😆

I used 'ideograph' for lack of a better term. 'Character' felt to loaded to me.

What I meant to convey is that the string contains two 'characters' that are made up of individual strokes, but they have pre-composed equivalents that are represented by single codepoints in Unicode. So in a language like Python (which represents strings as a collection of Unicode codepoints), they behave no differently from plain ASCII -- there are no surprises in reversal, and no conversion or checking needed.

Contrast that with this example in Hangul:

"기운찰만하다"[::-1]
'ᅡ다ᄒᆫᅡᄆᆯᅡᄎᆫᅮ이ᄀ'

However, this can be easily normalized to use the pre-combined Unicode codepoints:

import unicodedata #This is part of Pythons std lib

print(unicodedata.normalize('NFC', "기운찰만하다")[::-1])
다하만찰운기

The second test case "Würstchenstand" contains what I usually see referred to as a "grapheme cluster" where two unicode scalar values combine to one human-readable character. Note that ü can be represented as a single scalar value (UTF-8 bytes: [195, 188]) or the regular ascii u followed by a diaeresis ◌̈ (UTF-8 bytes: [117, 204, 136]).

For exmale, a simple solution with [::-1] results in the diaeresis being on the r:
>>> "Würstchenstand"[::-1]
'dnatsnehctsr̈uW'

Your example can also be normalized into combined codepoints:

import unicodedata

unicodedata.normalize('NFC', "Würstchenstand")[::-1])
dnatsnehctsrüW

However, this word in Thai cannot simply be normalized. This word has to be iterated through and checked for combining marks, and steps then need to be taken to keep the combining marks together as a grapheme cluster:

import unicode data

punicodedata.normalize('NFC', "ผู้เขียนโปรแกรม")[::-1]
มรกแรปโนยีขเู้ผ   #diacritical marks are now off by 1

As does this word in Bangla (which uses zero-width joiners):

import unicode data

unicodedata.normalize('NFC', "শিক্ষা দেওয়া")[::-1]
া়যওেদ াষ্কিশ

So these particular cases would force additional (and probably complex) code .... or the use of Unicode regex or another third-party library that implements a grapheme cluster break algorithm.

...which is all a long-winded way of saying Unicode is complicated, as are graphemes. 😄

senekor commented 8 months ago

@BethanyG thank you, I think I understand it better now :) What is it exaclty you suggest we do? Should I replace Würstchenstand with ผู้เขียนโปรแกรม? That would force people to have logic that cosiders grapheme clusters. (on the Rust track, we encourage the use of a third-party library for this)

BethanyG commented 8 months ago

@senekor Maybe rather than replace, we can add ผู้เขียนโปรแกรม as a third case? That way we've covered at least three major categories:

Single Unicode codepoints that are multi-byte
Unicode codepoints with combining characters that have combined Unicode equivalents
Unicode codepoints that are meant to combine together into extended grapheme clusters without any combined equivalent.

There are more cases (and let us not consider flags or emoji) - but as you have said, to handle everything correctly requires an expert written third-party library. And even then, language-specific knowledge is also required for anything complicated. But I think these cases are enough to prompt students into thinking about the complexities and encouraging them to learn more.

senekor commented 8 months ago

Alright, I added a third case with the Thai word. @BethanyG is the expected version correct?

BethanyG commented 8 months ago

Apologies for the delay, @senekor -- Yes! That is a correct reversal. 😄 Many thanks for your work and your patience.

senekor commented 8 months ago

Thank you! :slightly_smiling_face:

exercism / problem-specifications

reverse-string: Add unicode tests #2367