Open jingkaimori opened 1 year ago
Std lib on C++ 20 does not support iterate for utf-8 string, developers who use utf-8 should use codecvt to convert the whole string from utf-8 to u32string. So my suggestion is use utf-32 for both code and inner string variable.
Sounds good. using int32 to present a character is convenient when handling non-ascii characters. for example, emoji 😎
Std lib does not provide consistent unicode support for streams such as cin
and cout
. Types which developer could read from stdio includes char
and wchar_t
only. It's known that wchar_t
behaves differently between windows and unix-like systems.
For file and terminal io, we should use utf-8 explicitly. for windows, we can use SetConsoleOutputCP to specify code page of console.
Unicode support for std lib is deprecated, because all methods to initialize codecvt
is deprecated by c++ 20. We must use a 3rd party library to support unicode.
@PikachuHy so which library should we use, icu4c or utf8proc?
I have no idea now. What's the difference between icu4c and utf8proc?
@PikachuHy Generally, icu4c has more support for unicode than other libraries, but takes more space.
Features | utf8proc | icu4c | iconv | nowide |
---|---|---|---|---|
Encode and decode utf-8 string | ✔ | ✔ | ✔ | ✔ |
Iterate through utf-8 codepoints | ✔ | ✔ | ❌ | ❌ |
General Categories | ✔ | ✔ | ❌ | ❌ |
Bidirectional Categories | ✔ | ✔ | ❌ | ❌ |
Other Categories | ❌ | ✔ | ❌ | ❌ |
Decomposition type | ✔ | ✔ | ❌ | ❌ |
Boundary detect(tr29) | ✔ | ✔ | ❌ | ❌ |
Unicode Normalization Forms(tr15) | ✔ | ✔ | ❌ | ❌ |
Bidirectional algorithm(tr9) | ❌ | ✔ | ❌ | ❌ |
Case mapping | ✔ | ✔ | ❌ | ❌ |
Codepage detection | ❌ | ✔ | ❌ | ❌ |
Codepage conversion | ❌ | ✔ | ✔ | ❌ |
Locales, Formatting, ICNA, etc. | ❌ | ✔ | ❌ | ❌ |
Transliteration | ❌ | ✔ | ❓ | ❌ |
Collation, Searching and Regexp(Alphabetic order based) | ❌ | ✔ | ❌ | ❌ |
IOStream(stdio) | ❌ | ✔ | ❌ | ✔ |
System command, command line arguments, environment variable | ❌ | ❌ | ❌ | ✔ |
Search before asking
What happened + What you expected to happen
R7rs suggests that string may contain unicode characters, and builtin method of string such as
(string-ci=?)
should handle i18n case mapping. So this interpreter should adjust inner encoding ofString
class.There is several suggestions and requirements for unicode string handling in scheme. R7rs does not require constant complexity of
(string-set!)
and(string-ref)
, but requires index of string is index of code point. R7rs and The Scheme Programming Language does not suggest surrogate pairs in java.Support of unicode string varies between standard library and spdlog. Standard library supports indexing on utf-32, but indexing on utf-8 and utf-16 string in standard library is byte index, rather than code point index. Standard library supports case mapping for unicode also. According to api interface of spdlog, this library may only receive utf-8 encoding message.
So I suggest to use utf-32 as encoding of strings appears in this project. Although utf-32 consumes more memory space, because character in utf-32 occupies 4 bits, it's easier than utf-8 and utf-16 to locate code point by given index.
Another choice is to use utf-8. Utf-8 consumes less space than utf-16 and utf-32, and can be scanned from start to end, therefore can be used to store scheme input code.C++ std lib lack methods to iterate codepoint from utf-8 string.Reproduction way
THIS IS A FEATURE REQUEST AND PROPOSAL.
Anything else
Are you willing to submit a PR?