Example about Unicode characters mixes up bytes and characters

elixir-lang / elixir-lang.github.com

Website for Elixir

elixir-lang.org

354 stars 822 forks source link

Example about Unicode characters mixes up bytes and characters #1553

Closed dcz-self closed 3 years ago

dcz-self commented 3 years ago

The text mentions Unicode characters (rather unambiguously code points), but the example counts graphemes and bytes, and not characters:

In other words, what we would expect to be a single character, such as é or ł, can in practice be multiple characters, each represented by potentially multiple bytes. Consider the following:
iex> string = "hełło"
"hełło"
iex> String.length(string)
5
iex> byte_size(string)
7

A better string would be "héllo", which is "e" + "U+0301 COMBINING ACUTE ACCENT" (there's no standalone stroke of the "ł" that I could find, making the "can be multiple characters" assertion doubtful). It would result in:

5 graphemes
6 code points
7 bytes

I think what's missing from the tutorial is a way to count code points.

josevalim commented 3 years ago

Excellent catch. Let's remove ł from the list and use héllo instead. Probably say something like this:

In other words, what we would expect to be a single character, such as é or ł, can in practice be multiple codepoints, each represented by potentially multiple bytes. Consider the following:

What do you think? Would you like to send a PR? There are a couple ways you could count codepoints but, to be honest, I wouldn't focus on that, because it is the least used of them.

dcz-self commented 3 years ago

There's still the trouble with "ł", where I couldn't find a way to represent it with multiple codepoints. Regarding code points, I have no idea how to count them. I think they are important for interoperability, because graphemes have their own baggage: popular languages like Rust, Python don't have a simple way to calculate them, and on top of that they can fall apart into multiple ones. E.g. emoji + ZWJ + modifier may get collapsed into a single grapheme or not, depending on the available font.

I propose:

In other words, what we would expect to be a single character, such as é, can in practice be multiple codepoints (in this case, e and an acute accent), each represented by potentially multiple bytes. Consider the following:

If you show me how to count code points, I'm going to submit the result.

josevalim commented 3 years ago

I meant to remove ł from the text but I forgot. :D

Your suggestion is perfect. length(String.to_charlist(...)) can retrieve the codepoints.

dcz-self commented 3 years ago

Done in https://github.com/elixir-lang/elixir-lang.github.com/pull/1554

josevalim commented 3 years ago

Cool, thank you. Closing in favor of the PR.