Open CeleritasCelery opened 1 year ago
What if we keep two formats of string:
1) rust strings
2) vector of chars
some operations will demand 1) (inserts) and some 2) (regex). we covert to whatever we need, in the hope that strings we run regex over wont be inserted much and vice versa (is this true?)
(sorry wrong thread)
Emacs docs github comment with explanation Emacs uses an extended UTF-8 internally. It uses code points beyond the extended plane and can therefore use up to 5 bytes instead of the normal 4 for unicode. It also has a special "raw byte" encoding that is used for 128-255 encoding.
raw byte encoding
src/character.h
remacs source
Raw bytes are either plain ascii of if they are over the ascii range of 127, they are encoded using extended unicode codepoints. These extended code points don't follow the normal rules, and therefore will ocuppy two bytes in the space between the 1 byte and two byte range. For example if I wanted to encode 137 (#o211 #x89) as a raw byte, it would be code point 0x3FFF89. Notice that the hex value is the last byte of the code point. However I would lay it out in memory like this
00000 0000_1001
display
One tricky thing about this layout is that the same display representation can have two meanings. For example if I see
\211
It can either be codepoint
0x89
, or codepoint0x3FFF89
, the former being a normal unprintable unicode character, the second being a raw byte from Emacs extended UTF8. This can be confusing.solution 1 - Create custom Encoding format
This is the approach that Remacs has done. They basically have to reimplement all the string primitives on the new encoded format. This has the disadvantage that you can't reuse existing Rust libraries for strings. Things like regex will probably be okay, because they operate on
&[u8]
directly. But it will fail to match raw bytes, because they have a different representation.Solution 2 - Use bstr and assume conventional UTF-8
We still allow inserting any byte value into the buffer, but if it invalid we just leave it. String will use the bstr approach of validing the UTF-8 as needed. This is useful there is a community of crates that support "conventional UTF-8" instead of "always UTF-8" that normal rust string follow. So long as crate takes
[u8]
orbstr
, we can use it.