Open Krzmbrzl opened 1 year ago
In case you are currently not using UTF-8, I would like to draw your attention to http://utf8everywhere.org/, where arguments for choosing UTF-8 instead of other encodings are presented.
@Krzmbrzl yes, we use UTF-8 "internally". The decision to use std::wstring
was indeed made a long ago when I did not necessarily understand subtleties of unicode processing (we've been using std::wstring
in MPQC since early 2010s). Our use of strings in SeQuant is primarily for character-centric encoding of data structures; viewing the string as a sequence of characters should be as easy/familiar to non-specialists as possible; wide characters make it natural. With char
-based encoding asking questions like "does this Index
contain single label or subscripted with a numerical index" becomes far more involved.
Clearly you are concerned about std::wstring
leaking out the API ... how much of a deal-breaker is this?
Clearly you are concerned about std::wstring leaking out the API ... how much of a deal-breaker is this?
I wouldn't exactly call it a deal-breaker. From my experience, it's just a little unusual to be required to use wide strings and wide characters for everything. Especially since in SeQuant most strings are tensor and index names, which generally tend to be rather short. Thus, the required additional L
prefix in front of every literal has a relatively high overhead in terms of characters that need to be typed in order to get things done.
But this is really just a bit of an inconvenience and not a deal-breaker.
Our use of strings in SeQuant is primarily for character-centric encoding of data structures; viewing the string as a sequence of characters should be as easy/familiar to non-specialists as possible; wide characters make it natural.
Yes and no. Based on my research on wide character / wide string support in C++, it seems like the size of a wide character is not the same across different platforms. E.g. on Windows it seems to be only 2 bytes wide, which means that BMP characters can be represented in one code point (e.g. one wide character), but any non-BMP character will still use multiple code points per character.
With char-based encoding asking questions like "does this Index contain single label or subscripted with a numerical index" becomes far more involved.
As long as the things that you are looking for are ASCII characters, I think it should work just fine, shouldn't it? Afaik, UTF encodings guarantee that all code points that look like ASCII characters, really are ASCII characters. So one shouldn't have to worry about accidentally matching a code point that looks like the searched-for ASCII character, but really is part of a multi-code-point encoding.
yes, we use UTF-8 "internally".
Side-note: A quick test (https://godbolt.org/z/TbnGb3Gx6) seems to indicate that as soon as one is using wide strings / wide characters, you are no longer using UTF-8, but rather UTF -16 or UTF-32 (see above).
See also https://www.boost.org/doc/libs/1_82_0/libs/locale/doc/html/recommendations_and_myths.html
Overall, I tend to think that in order to properly deal with all facets of Unicode properly (on all platforms), use of a dedicated Unicode library makes things easier. That'd be ICU or Boost.Locale. That would require doing string processing that depends on how many bytes a character is encoded as via library functions instead of functions from the C++ standard, but at least you can be pretty certain that those functions will always work.
@Krzmbrzl thanks for making me revisit the issue.
Yes, unfortunately wide strings everywhere is unusual/uncomfortable. Despite the drawbacks, I still think using std::wstring
as the main vehicle makes it easier for novices to think in characters and never have to learn about code units. However, most of character-centric code resides in Index
, so perhaps we can transition to std::string
by allowing its API to support both narrow and wide chars.
Indeed (and I was not aware of this) that the default wide character encoding is platform- and compiler-option dependent: https://godbolt.org/z/eY1TocTPf
In my opinion ASCII chars are not enough: greek characters at a minimum are needed, and many others (subscripts) are also very useful for making the code resemble math as far as possible. Going beyond BMP though is not as relevant.
never have to learn about code units
Until things break, because someone used a non-BMP character :) I would argue that by using an actual Unicode library for doing actual string processing (e.g. searching stuff in it, using regexes on it, etc.) instead of only passing it along or comparing it for identity (or hashing it) gives the same benefits in terms of thinking in characters while at the same time being save that this is indeed justified (and won't cause (subtle) bugs with some characters). But ultimately, it's your decision of course.
However, most of character-centric code resides in Index, so perhaps we can transition to std::string by allowing its API to support both narrow and wide chars.
Index labels and tensor names are the two instances in which I have mostly encountered the wstring API so far. So yeah, providing overloads for these APIs to also accept regular strings would mitigate the effect somewhat.
However, things like to_latex
also return a wstring
, which has to be fed to wcout
instead of cout
. So I am wondering, whether consistency might be better than convenience for some parts of the API. 'cause right now, one only has to keep in mind that SeQuant works with wstring
s, whereas if parts of the API work with regular strings, one would have to remember which do and which don't.
Then again, the rule could simply be: public API that takes a string as an input takes either, but all strings you get as output will be wstring
s :thinking:
I would highly recommend sticking to a single string encoding in your code. UTF-8 encoding everywhere is probably the easiest way to go. Converting between encodings gets very complicated to say the least. The standard c++ library string conversion implementations are a bit buggy from what I have read (though I can't find the reference anymore).
Complications arise in that there are multiple ways some characters can be encoded. This also causes problems for string comparison as two equivalent characters can have different byte patterns. Using wide characters does not avoid this problem either.
To understand this better, here is a brief explanation of unicode character encoding.
The most common cases where you will see grapheme clusters with more than 1 code point is for characters with accents or other marks, and emojis. This hierarchy makes things difficult to deal with when trying to find character boundaries.
The other half of the difficulty is that a single character can have multiple encodings. The order or number of code points in a character can differ (typically this can happen for characters with multiple accents or marks). Or can have a different number of code points (typically letters with a single accent or mark can be encoded with 1 or 2 code points). The encoding that is used depends on specific implementations, OS, and other factors.
Similarly, conversion between encodings can be implemented in multiple ways due to the fact that some characters have multiple valid encodings.
Using wide characters can simplify iterating over single characters, assuming all the characters you want to represent are composed of a single code unit. But it does not address the issue of grapheme clusters that consist of multiple code points, nor issues around conversion.
If you want to be very robust, I recommend using ICU character break iterator to split string on grapheme clusters. https://unicode-org.github.io/icu/userguide/boundaryanalysis/ Then use ICU character comparison and/or normalization.
If you want to assume everything is a single code point (assuming you are not using accents or marks on characters), then simply splitting on code points may be sufficient. You can write a relatively short function with ICU to do this.
#include <string_view>
#include <unicode/uchar.h>
#include <unicode/utf8.h>
// Use ICU to iterate over each code point in a string, and call a callback for each one.
template <typename callbackT>
void foreach_char(std::string_view str, callbackT callback) {
UErrorCode status = U_ZERO_ERROR;
UCharIterator iter;
uiter_setUTF8(&iter, str.data(), str.size());
UChar32 c;
while ((c = uiter_next32(&iter)) >= 0) {
callback(c);
}
}
Another, seemingly lightweight option for an external UTF-8 string library is https://github.com/sheredom/utf8.h
From e.g. https://github.com/ValeevGroup/SeQuant/blob/1dd6257ba9e5b2e9aa2b71eb38768f8a8a1699c7/examples/srcc/srcc.cpp#L97-L98 I tend to imply that SeQuant wants to use UTF-8. However, in the code you chose to use wide strings (
std::wstring
), which would instead indicate that you really want to use UTF-16 or UTF-32.This opens up two questions for me:
L
)