bingoohuang / blog

write blogs with issues
MIT License
178 stars 24 forks source link

UTF-8 #203

Open bingoohuang opened 3 years ago

bingoohuang commented 3 years ago

coding rules

1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx       7 007F hex (127)
110xxxxx 10xxxxxx     (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx   (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)

UTF-8 Encoding

Bear plus snowflake equals polar bear

https://andysalerno.com/posts/weird-emojis/#

๐Ÿ‘ฉ๐Ÿพ + โค + ๐Ÿ’‹ + ๐Ÿ‘ฉ๐Ÿป = image

๐Ÿป (bear; U+1F43B) + โ„ (snowflake; U+2744) \= ๏ธ๏ธ(polar bear; U+1F43B U+200D U+2744 U+FE0F)

So, as we have learned, a Unicode character can be made of multiple bytes, but it can also be made of multiple other Unicode characters. And they can be quite large โ€“ 35 bytes, in the earlier example.

package main

import (
    "fmt"
    "reflect"
)

func main() {
    fmt.Println("๐Ÿ™‚ is this many runes:", fmt.Sprintf("%08b", '๐Ÿ™‚'), "printed as strings:", runesAsStrings([]rune("๐Ÿ™‚")))
    fmt.Println("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป is this many runes:", []rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป")))
    fmt.Println("๐Ÿ‘ฉ๐Ÿฟ is this many runes:", []rune("๐Ÿ‘ฉ๐Ÿฟ"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉ๐Ÿฟ")))
    fmt.Println("๐Ÿ‘ฉโ€๐Ÿš€๏ธ is this many runes:", []rune("๐Ÿ‘ฉโ€๐Ÿš€๏ธ"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉโ€๐Ÿš€๏ธ")))
    fmt.Println("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป is this many runes:", []rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป")))
    // Creating a rune
    rune1 := 'B'
    rune2 := 'g'
    rune3 := '\a'

    // Displaying rune and its type
    fmt.Printf("Rune 1: %c; %08b Unicode: %U; Type: %s\n", rune1, rune1, rune1, reflect.TypeOf(rune1))
    fmt.Printf("Rune 2: %c; %08b Unicode: %U; Type: %s\n", rune2, rune2, rune2, reflect.TypeOf(rune2))
    fmt.Printf("Rune 3: %c; %08b Unicode: %U; Type: %s\n", rune3, rune3, rune3, reflect.TypeOf(rune3))
}

func runesAsStrings(runes []rune) (s string) {
    for _, r := range runes {
        s += string(r)
    }
    return
}

That's why it's called a rune (a code point), and not a grapheme cluster ;)

่ฟ™ๅฐฑๆ˜ฏไธบไป€ไนˆๅฎƒ่ขซ็งฐไธบ็ฌฆๆ–‡(ไธ€ไธชไปฃ็ ็‚น) ๏ผŒ่€Œไธๆ˜ฏๅญ—็ด ้›†็พค;)

https://www.reddit.com/r/golang/comments/o1o5hr/fyi_a_single_go_rune_is_not_the_same_as_a_single

  1. String length is not always rune length ๅญ—็ฌฆไธฒ้•ฟๅบฆๅนถไธๆ€ปๆ˜ฏ็ฌฆๆ–‡้•ฟๅบฆ
  2. rune count is not always rune width (monospace font) ็ฌฆๆ–‡่ฎกๆ•ฐๅนถไธๆ€ปๆ˜ฏ็ฌฆๆ–‡ๅฎฝๅบฆ(ๅ•็ฉบ้—ดๅญ—ไฝ“)
  3. Unicode is hard Unicode ๅพˆ้šพ