CoatiSoftware / Sourcetrail

Sourcetrail - free and open-source interactive source explorer
https://www.sourcetrail.com/
GNU General Public License v3.0
14.89k stars 1.4k forks source link

Discussion: standards for string, string_view usage in the app? #982

Open LouisStAmour opened 4 years ago

LouisStAmour commented 4 years ago

Partly prompted by https://abseil.io/tips/1 but also by http://utf8everywhere.org/ I'd like to spend a small bit of time reviewing the codebase for string usage, so that if #971 merges, we can consider adopting string_view and QStringView in more places.

So right now, we've a tendency to try and reduce the usage of QString and other Qt classes from parameters of utility wrappers, or any other parameters passed through the app.

I'm not here to suggest changing this, I can see why you'd want to reduce the usage of Qt classes from library code both to reduce dependencies on Qt but also to partition out GUI-related code from non-GUI code.

However, between the two, string_view offers a compelling reason to go through the app and change string method parameters in bulk to use comparable string views, while http://utf8everywhere.org/ makes the compelling argument that UTF-16 has as many problems as UTF-8 without most developers testing for such bugs, and UTF-8 is the standard expected everywhere but Windows and Qt APIs at this point. Well, not quite. .NET and Java also use UTF-16 internally, as apparently does JavaScript. (Sigh.)

If we choose UTF-8, the argument that std::string now means UTF-8 is compelling and we'd only have to worry about exceptions when dealing with third-party APIs.

If we choose UTF-16 we then have two problems. First, wstring is only a good match for UTF-16 in Windows, so there's an extra penalty to convert from wstring to QString on Linux. Internally QString is stored as unsigned short but treated as char16_t. See https://bugreports.qt.io/browse/QTBUG-67970?focusedCommentId=401306&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-401306 for relevant conversation. And second, we'll still need to talk over network APIs, write to files, and we'll pay a conversion cost there to UTF-8.

Finally, user codebases can themselves have different encodings, I imagine we have tests for these, but most source code I've ever seen is in UTF-8 format or another 8-bit format. If you don't use UTF-8, you're going to have problems: https://what.thedailywtf.com/topic/26243/git-hates-utf-16

Arguments have been made that if UTF-16 is properly implemented such that it handles code points correctly is identical to UTF-8, see https://softwareengineering.stackexchange.com/a/102206 and and https://softwareengineering.stackexchange.com/a/102224 and https://unicode.org/faq/utf_bom.html#utf16-11

Qt has multiple times in the last year discussed moving to UTF-8, some of the conversation can be found on their mailing list from January 2019. There they mention being unable to implement using char8_t, and that led me to find this post: https://dbj.org/c-char8_t-is-broken/ Note that later down the post has an error because it says JS uses UTF-8, which isn't strictly true. It's actually much more complicated: see https://mathiasbynens.be/notes/javascript-encoding and the comment in that page by Henri Sivonen.

Further discussion of the UTF-8 Everywhere article can be found at https://news.ycombinator.com/item?id=22867503 but what I haven't found yet is what benefits there are to using char8_t over std::string, though I suppose it's sort of an "opaque type" to use a u8string instead of a regular string to hold the same data. Like https://github.com/joboccara/NamedType or the slightly more odd https://brevzin.github.io/c++/2019/12/02/named-arguments/ (see also: https://abseil.io/tips/172) and https://accu.org/index.php/journals/2683 -- all of which are slightly different hacks around the same issues with C++. https://wiki.hsr.ch/PeterSommerlad/files/2019_Sommerlad_TypeSystem.pdf has a few other alternatives. Where am I going with this last part? Well, in trying to replace strings in the app, there's an argument that we could focus on getting the right structs or using some kind of opaque typing method as we go.

While we're at it, we should look to see where we can use https://abseil.io/tips/64 in strings too.

egraether commented 4 years ago

Our current convention is:

This has worked well for us since we adopted it. We also didn't notice any major performance impacts after switching from std::string to std::wstring. As far as I understand it, performance is only impacted when there are actually 'wide characters' in the data.

If you can find places where using std::string_view leads to measurable performance improvements, then feel free to open pull requests.