Closed dcz-self closed 3 years ago
Excellent catch. Let's remove ł
from the list and use héllo
instead. Probably say something like this:
In other words, what we would expect to be a single character, such as é or ł, can in practice be multiple codepoints, each represented by potentially multiple bytes. Consider the following:
What do you think? Would you like to send a PR? There are a couple ways you could count codepoints but, to be honest, I wouldn't focus on that, because it is the least used of them.
There's still the trouble with "ł", where I couldn't find a way to represent it with multiple codepoints. Regarding code points, I have no idea how to count them. I think they are important for interoperability, because graphemes have their own baggage: popular languages like Rust, Python don't have a simple way to calculate them, and on top of that they can fall apart into multiple ones. E.g. emoji + ZWJ + modifier may get collapsed into a single grapheme or not, depending on the available font.
I propose:
In other words, what we would expect to be a single character, such as é, can in practice be multiple codepoints (in this case, e and an acute accent), each represented by potentially multiple bytes. Consider the following:
If you show me how to count code points, I'm going to submit the result.
I meant to remove ł from the text but I forgot. :D
Your suggestion is perfect. length(String.to_charlist(...))
can retrieve the codepoints.
Cool, thank you. Closing in favor of the PR.
The text mentions Unicode characters (rather unambiguously code points), but the example counts graphemes and bytes, and not characters:
A better string would be "héllo", which is "e" + "U+0301 COMBINING ACUTE ACCENT" (there's no standalone stroke of the "ł" that I could find, making the "can be multiple characters" assertion doubtful). It would result in:
I think what's missing from the tutorial is a way to count code points.