CeleritasCelery / rune

Rust VM for Emacs
GNU General Public License v3.0
432 stars 24 forks source link

Text representation #15

Open CeleritasCelery opened 1 year ago

CeleritasCelery commented 1 year ago

Emacs docs github comment with explanation Emacs uses an extended UTF-8 internally. It uses code points beyond the extended plane and can therefore use up to 5 bytes instead of the normal 4 for unicode. It also has a special "raw byte" encoding that is used for 128-255 encoding.

raw byte encoding

src/character.h

character code  1st byte   byte sequence
--------------  --------   -------------
     0-7F       00..7F     0xxxxxxx
    80-7FF      C2..DF     110yyyyx 10xxxxxx
   800-FFFF     E0..EF     1110yyyy 10yxxxxx 10xxxxxx
 10000-1FFFFF   F0..F7     11110yyy 10yyxxxx 10xxxxxx 10xxxxxx
200000-3FFF7F   F8         11111000 1000yxxx 10xxxxxx 10xxxxxx 10xxxxxx
3FFF80-3FFFFF   C0..C1     1100000x 10xxxxxx (for eight-bit-char)
400000-...      invalid

invalid 1st byte    80..BF     10xxxxxx
         F9..FF    11111yyy

In each bit pattern, 'x' and 'y' each represent a single bit of the
character code payload, and at least one 'y' must be a 1 bit.
In the 5-byte sequence, the 22-bit payload cannot exceed 3FFF7F.

remacs source

Raw 8-bit bytes are represented by codepoints 0x3FFF80 to 0x3FFFFF. However, in the UTF-8 like encoding, where they should be represented by a 5-byte sequence starting with 0xF8, they are instead represented by a 2-byte sequence starting with 0xC0 or 0xC1. These 2-byte sequences are disallowed in UTF-8, because they would form a duplicate encoding for the the 1-byte ASCII range.

Raw bytes are either plain ascii of if they are over the ascii range of 127, they are encoded using extended unicode codepoints. These extended code points don't follow the normal rules, and therefore will ocuppy two bytes in the space between the 1 byte and two byte range. For example if I wanted to encode 137 (#o211 #x89) as a raw byte, it would be code point 0x3FFF89. Notice that the hex value is the last byte of the code point. However I would lay it out in memory like this

00000 0000_1001

display

One tricky thing about this layout is that the same display representation can have two meanings. For example if I see

\211

It can either be codepoint 0x89, or codepoint 0x3FFF89, the former being a normal unprintable unicode character, the second being a raw byte from Emacs extended UTF8. This can be confusing.

solution 1 - Create custom Encoding format

This is the approach that Remacs has done. They basically have to reimplement all the string primitives on the new encoded format. This has the disadvantage that you can't reuse existing Rust libraries for strings. Things like regex will probably be okay, because they operate on &[u8] directly. But it will fail to match raw bytes, because they have a different representation.

Solution 2 - Use bstr and assume conventional UTF-8

We still allow inserting any byte value into the buffer, but if it invalid we just leave it. String will use the bstr approach of validing the UTF-8 as needed. This is useful there is a community of crates that support "conventional UTF-8" instead of "always UTF-8" that normal rust string follow. So long as crate takes [u8] or bstr, we can use it.

Alan-Chen99 commented 1 year ago

What if we keep two formats of string: 1) rust strings 2) vector of chars some operations will demand 1) (inserts) and some 2) (regex). we covert to whatever we need, in the hope that strings we run regex over wont be inserted much and vice versa (is this true?)

(sorry wrong thread)