Micro-blog instructions should explain graphemes and possibly test extended graphemes

ageron commented 1 month ago

The instructions of the micro-blog exercise say:

The trick to this exercise is to use APIs designed around Unicode characters (codepoints) instead of Unicode codeunits.

I understand that we want to keep things simple, but I think this is misleading. For example, in the Roc track the instructions led some people to split the string into codepoints when in fact there's actually a very simple function to split the string into graphemes instead: the tests pass in both cases because they only include graphemes composed of a single codepoint, but they would fail if the tests included flags, or characters with multiple diacritics, or complex emojis, or basically any grapheme composed of multiple codepoints (i.e., extended grapheme clusters).

In short: we shouldn't encourage people to work with codepoints when they can just as easily work with graphemes.

I suggest at least updating the instructions to cover graphemes, but also including some tests with extended grapheme clusters. If we're going to handle unicode, we should try to handle all possible characters. Handling graphemes might be harder in some languages, but in that case they can just disable the extended grapheme tests.

Edit: I'm happy to submit a PR if there's an agreement on this issue.

Cool-Katt commented 1 month ago

The policy is usually to discus first on the Exercism forum, as it is the most active place for such issues with most amount of eyeballs. If you leave it here it'll probably go nowhere, so head on over to the forum and start a new thread.

IsaacG commented 1 month ago

Moved to https://forum.exercism.org/t/micro-blog-exercise-should-cover-graphemes/13315

exercism / problem-specifications

Micro-blog instructions should explain graphemes and possibly test extended graphemes #2483