Internal encoding and string type

Krzmbrzl commented 1 year ago

From e.g. https://github.com/ValeevGroup/SeQuant/blob/1dd6257ba9e5b2e9aa2b71eb38768f8a8a1699c7/examples/srcc/srcc.cpp#L97-L98 I tend to imply that SeQuant wants to use UTF-8. However, in the code you chose to use wide strings (std::wstring), which would instead indicate that you really want to use UTF-16 or UTF-32.

This opens up two questions for me:

What encoding is SeQuant using internally? UTF-8 or UTF-16/UTF-32?
If you're indeed using UTF-8: Why are you using wide strings to represent that? Normally, one would use regular strings for UTF-8 (which would also eliminate the need for having to prefix all string literals with L)

Krzmbrzl commented 1 year ago

In case you are currently not using UTF-8, I would like to draw your attention to http://utf8everywhere.org/, where arguments for choosing UTF-8 instead of other encodings are presented.

evaleev commented 1 year ago

@Krzmbrzl yes, we use UTF-8 "internally". The decision to use std::wstring was indeed made a long ago when I did not necessarily understand subtleties of unicode processing (we've been using std::wstring in MPQC since early 2010s). Our use of strings in SeQuant is primarily for character-centric encoding of data structures; viewing the string as a sequence of characters should be as easy/familiar to non-specialists as possible; wide characters make it natural. With char-based encoding asking questions like "does this Index contain single label or subscripted with a numerical index" becomes far more involved.

Clearly you are concerned about std::wstring leaking out the API ... how much of a deal-breaker is this?

Krzmbrzl commented 1 year ago

Clearly you are concerned about std::wstring leaking out the API ... how much of a deal-breaker is this?

I wouldn't exactly call it a deal-breaker. From my experience, it's just a little unusual to be required to use wide strings and wide characters for everything. Especially since in SeQuant most strings are tensor and index names, which generally tend to be rather short. Thus, the required additional L prefix in front of every literal has a relatively high overhead in terms of characters that need to be typed in order to get things done. But this is really just a bit of an inconvenience and not a deal-breaker.

Our use of strings in SeQuant is primarily for character-centric encoding of data structures; viewing the string as a sequence of characters should be as easy/familiar to non-specialists as possible; wide characters make it natural.

Yes and no. Based on my research on wide character / wide string support in C++, it seems like the size of a wide character is not the same across different platforms. E.g. on Windows it seems to be only 2 bytes wide, which means that BMP characters can be represented in one code point (e.g. one wide character), but any non-BMP character will still use multiple code points per character.

With char-based encoding asking questions like "does this Index contain single label or subscripted with a numerical index" becomes far more involved.

As long as the things that you are looking for are ASCII characters, I think it should work just fine, shouldn't it? Afaik, UTF encodings guarantee that all code points that look like ASCII characters, really are ASCII characters. So one shouldn't have to worry about accidentally matching a code point that looks like the searched-for ASCII character, but really is part of a multi-code-point encoding.

yes, we use UTF-8 "internally".

Side-note: A quick test (https://godbolt.org/z/TbnGb3Gx6) seems to indicate that as soon as one is using wide strings / wide characters, you are no longer using UTF-8, but rather UTF -16 or UTF-32 (see above).

Overall, I tend to think that in order to properly deal with all facets of Unicode properly (on all platforms), use of a dedicated Unicode library makes things easier. That'd be ICU or Boost.Locale. That would require doing string processing that depends on how many bytes a character is encoded as via library functions instead of functions from the C++ standard, but at least you can be pretty certain that those functions will always work.

evaleev commented 1 year ago

@Krzmbrzl thanks for making me revisit the issue.

Yes, unfortunately wide strings everywhere is unusual/uncomfortable. Despite the drawbacks, I still think using std::wstring as the main vehicle makes it easier for novices to think in characters and never have to learn about code units. However, most of character-centric code resides in Index, so perhaps we can transition to std::string by allowing its API to support both narrow and wide chars.

Indeed (and I was not aware of this) that the default wide character encoding is platform- and compiler-option dependent: https://godbolt.org/z/eY1TocTPf

In my opinion ASCII chars are not enough: greek characters at a minimum are needed, and many others (subscripts) are also very useful for making the code resemble math as far as possible. Going beyond BMP though is not as relevant.

Krzmbrzl commented 1 year ago

never have to learn about code units

Until things break, because someone used a non-BMP character :) I would argue that by using an actual Unicode library for doing actual string processing (e.g. searching stuff in it, using regexes on it, etc.) instead of only passing it along or comparing it for identity (or hashing it) gives the same benefits in terms of thinking in characters while at the same time being save that this is indeed justified (and won't cause (subtle) bugs with some characters). But ultimately, it's your decision of course.

However, most of character-centric code resides in Index, so perhaps we can transition to std::string by allowing its API to support both narrow and wide chars.

Index labels and tensor names are the two instances in which I have mostly encountered the wstring API so far. So yeah, providing overloads for these APIs to also accept regular strings would mitigate the effect somewhat. However, things like to_latex also return a wstring, which has to be fed to wcout instead of cout. So I am wondering, whether consistency might be better than convenience for some parts of the API. 'cause right now, one only has to keep in mind that SeQuant works with wstrings, whereas if parts of the API work with regular strings, one would have to remember which do and which don't. Then again, the rule could simply be: public API that takes a string as an input takes either, but all strings you get as output will be wstrings :thinking:

justusc commented 1 year ago

I would highly recommend sticking to a single string encoding in your code. UTF-8 encoding everywhere is probably the easiest way to go. Converting between encodings gets very complicated to say the least. The standard c++ library string conversion implementations are a bit buggy from what I have read (though I can't find the reference anymore).

Complications arise in that there are multiple ways some characters can be encoded. This also causes problems for string comparison as two equivalent characters can have different byte patterns. Using wide characters does not avoid this problem either.

To understand this better, here is a brief explanation of unicode character encoding.

What you think of as a single character is actually a grapheme cluster.
A grapheme cluster can be composed of 1 or more code points.
A code point is an integer value that can be encoded in multiple ways (UTF-8, UTF-16, UTF-32, etc.)
A code point is encoded with 1 or more code units. UTF-8 code units are 1 byte, UTF-16 code units are 2 bytes, and UTF-32 code units are 4 bytes. There are many other encodings but these are common.

The most common cases where you will see grapheme clusters with more than 1 code point is for characters with accents or other marks, and emojis. This hierarchy makes things difficult to deal with when trying to find character boundaries.

The other half of the difficulty is that a single character can have multiple encodings. The order or number of code points in a character can differ (typically this can happen for characters with multiple accents or marks). Or can have a different number of code points (typically letters with a single accent or mark can be encoded with 1 or 2 code points). The encoding that is used depends on specific implementations, OS, and other factors.

Similarly, conversion between encodings can be implemented in multiple ways due to the fact that some characters have multiple valid encodings.

Using wide characters can simplify iterating over single characters, assuming all the characters you want to represent are composed of a single code unit. But it does not address the issue of grapheme clusters that consist of multiple code points, nor issues around conversion.

If you want to be very robust, I recommend using ICU character break iterator to split string on grapheme clusters. https://unicode-org.github.io/icu/userguide/boundaryanalysis/ Then use ICU character comparison and/or normalization.

If you want to assume everything is a single code point (assuming you are not using accents or marks on characters), then simply splitting on code points may be sufficient. You can write a relatively short function with ICU to do this.

#include <string_view>
#include <unicode/uchar.h>
#include <unicode/utf8.h>

// Use ICU to iterate over each code point in a string, and call a callback for each one.
template <typename callbackT>
void foreach_char(std::string_view str, callbackT callback) {
    UErrorCode status = U_ZERO_ERROR;
    UCharIterator iter;
    uiter_setUTF8(&iter, str.data(), str.size());
    UChar32 c;
    while ((c = uiter_next32(&iter)) >= 0) {
        callback(c);
    }
}

Krzmbrzl commented 1 year ago

Another, seemingly lightweight option for an external UTF-8 string library is https://github.com/sheredom/utf8.h

ValeevGroup / SeQuant

Internal encoding and string type #96