[Feature Request] Unicode support for string

jingkaimori commented 1 year ago

Search before asking

[x] I searched the issues and found no similar issues.

What happened + What you expected to happen

R7rs suggests that string may contain unicode characters, and builtin method of string such as (string-ci=?) should handle i18n case mapping. So this interpreter should adjust inner encoding of String class.

There is several suggestions and requirements for unicode string handling in scheme. R7rs does not require constant complexity of (string-set!) and (string-ref), but requires index of string is index of code point. R7rs and The Scheme Programming Language does not suggest surrogate pairs in java.

Support of unicode string varies between standard library and spdlog. Standard library supports indexing on utf-32, but indexing on utf-8 and utf-16 string in standard library is byte index, rather than code point index. Standard library supports case mapping for unicode also. According to api interface of spdlog, this library may only receive utf-8 encoding message.

So I suggest to use utf-32 as encoding of strings appears in this project. Although utf-32 consumes more memory space, because character in utf-32 occupies 4 bits, it's easier than utf-8 and utf-16 to locate code point by given index.

~~Another choice is to use utf-8. Utf-8 consumes less space than utf-16 and utf-32, and can be scanned from start to end, therefore can be used to store scheme input code.~~ C++ std lib lack methods to iterate codepoint from utf-8 string.

Reproduction way

THIS IS A FEATURE REQUEST AND PROPOSAL.

Anything else

Are you willing to submit a PR?

[x] Yes I am willing to submit a PR!

jingkaimori commented 1 year ago

Std lib on C++ 20 does not support iterate for utf-8 string, developers who use utf-8 should use codecvt to convert the whole string from utf-8 to u32string. So my suggestion is use utf-32 for both code and inner string variable.

PikachuHy commented 1 year ago

Sounds good. using int32 to present a character is convenient when handling non-ascii characters. for example, emoji 😎

jingkaimori commented 1 year ago

Std lib does not provide consistent unicode support for streams such as cin and cout. Types which developer could read from stdio includes char and wchar_t only. It's known that wchar_t behaves differently between windows and unix-like systems.

For file and terminal io, we should use utf-8 explicitly. for windows, we can use SetConsoleOutputCP to specify code page of console.

jingkaimori commented 1 year ago

Unicode support for std lib is deprecated, because all methods to initialize codecvt is deprecated by c++ 20. We must use a 3rd party library to support unicode.

jingkaimori commented 1 year ago

@PikachuHy so which library should we use, icu4c or utf8proc?

PikachuHy commented 1 year ago

I have no idea now. What's the difference between icu4c and utf8proc?

jingkaimori commented 1 year ago

@PikachuHy Generally, icu4c has more support for unicode than other libraries, but takes more space.

Features	utf8proc	icu4c	iconv	nowide
Encode and decode utf-8 string	✔	✔	✔	✔
Iterate through utf-8 codepoints	✔	✔	❌	❌
General Categories	✔	✔	❌	❌
Bidirectional Categories	✔	✔	❌	❌
Other Categories	❌	✔	❌	❌
Decomposition type	✔	✔	❌	❌
Boundary detect(tr29)	✔	✔	❌	❌
Unicode Normalization Forms(tr15)	✔	✔	❌	❌
Bidirectional algorithm(tr9)	❌	✔	❌	❌
Case mapping	✔	✔	❌	❌
Codepage detection	❌	✔	❌	❌
Codepage conversion	❌	✔	✔	❌
Locales, Formatting, ICNA, etc.	❌	✔	❌	❌
Transliteration	❌	✔	❓	❌
Collation, Searching and Regexp(Alphabetic order based)	❌	✔	❌	❌
IOStream(stdio)	❌	✔	❌	✔
System command, command line arguments, environment variable	❌	❌	❌	✔

PikachuHy / pscm