chromium / subspace

A concept-centered standard library for C++20, enabling safer and more reliable products and a more modern feel for C++ code.; Also home of Subdoc the code-documentation generator.
https://suslib.cc
Apache License 2.0
89 stars 14 forks source link

Strings and Chars #144

Open danakj opened 1 year ago

danakj commented 1 year ago

CStrings, OSStrings, Strings, string constants. Char (aka unicode codepoints).

There's a lot of possible ways to take this design space-wise.

What's important is that C++ strings full of not-utf8 have somewhere to go that won't crash. So probably String isn't utf8. Maybe UString or something.

Also string constants should not need relocation, so if we can't do that with a type then we can't and they should stay as chars but I think we can. string_view does it now? constexpr? consteval?

danakj commented 1 year ago

If you ever take the address of a constexpr variable, it introduces relocations at startup.

Thus using std::string_view for constexpr string literals, and then passing it as a const&, would always be bad.

We saw in https://danakj.github.io/2023/06/05/not-generating-constructors.html that it can be good to allow passing view types as references. To support that, we should have a separate StringLit type that's used for string literals. The StringView type (and String type) should be constructed from it receiving it by value, not by const&, so as to avoid causing relocations.

danakj commented 1 year ago

Wrote more here: https://sunny.garden/@blinkygal/110940469036696888

Did you know there's no good way to receive a string parameter in C++ when you are working with C libs/apis that doesn't create string copies and heap allocations for no reason at all?

C libs need a char* with a NUL terminator.

If you receive a char*, then your caller can't give you a string subset, they have to copy into a new buffer and put a 0 on the end. There's a chance they can do this on the stack at least but if they use std::string to do it (probably they should) it means a heap allocation (let's ignore the SSO thing).

If you receive a string_view (the new modern thing!) you can receive any substring without a problem. But string_view erases the knowledge of whether the string was NUL terminated, and checking for it can be an OOB read and crash or do bad things. So you are obligated to always copy the string_view into another buffer and add a NUL at the end before passing it off to the C api.

If you receive a const std::string&, and the caller had a std::string, you're now zero-copy. Buf if the caller had a string literal (like "hello") or a string_view, then a std::string will be constructed again and do a heap alloc/memcpy into it.

Ideally you'd be able to keep track of whether there's a NUL at the end, but string_view's size omits it (to match string, which omits it but is always NUL terminated). If you could tell, then you could do the copy only when needed.

One option is to write 3 overloads for every function all through your stack. :blobcatreach:

danakj commented 1 year ago

Some current pros/cons and thoughts around string options.

Cons to making string types:

  1. Everything passes around std::string so making things convert and copy is bad
  2. Rust strings require UTF8, then it has OsStr which doesn't and CString which has a null term. C++ just puts them all in std::string
    • mcc made a nice comment on this on fedi), but you can't just put your std::strings into a Rust String, nor could you into a C++ version then, without panic risks.
  3. I don't want to rewrite a formatting library and fmtlib (what subspace interacts with) works all with std::strings (rightfully). Just a good example of the first point.

Pros to making string types

  1. The nice thing about them is the rich API, and that it interacts with slice/vec and iterators
  2. Having to use an ostringstream to build a string dynamically is quite awful. Parsing into strings is worse.

This is actually a big pain point of interop with Rust that people don't get to feel yet cuz we're not past the UB. But adding painful interop with C++ would be very strongly bad.

I don't have a good feeling that a non-NUL-terminated UTF-8 string is the right choice for any C++ code tho at this point. So I have been just using std::string so far, and I don't have a good feeling about what to do either way atm.

For https://github.com/chromium/subspace/issues/326 I am currently pondering over

danakj commented 1 year ago

@evmar pointed me to https://ziglang.org/documentation/master/#Sentinel-Terminated-Slices

Zig has a SentinelTerminatedSlice (with concise syntax) that can be constructed out of any other slice by choosing a range that has the terminator at the end. It could then report its size without the terminator but you could ensure it was there in the type system for C apis.

This idea can't be the "concept that matches all three" idea, as it needs to guarantee the null terminator. The other ideas I had so far track if there's a terminator.

Presumably such a type would convert to a regular Slice, and drop the terminator. Vec could grow .as_terminated_slice<Terminator>(). A string type could have .as_terminated_str() or whatever. But if we don't have our own string type or want to play with std::string I think it has to be a ctor on SentinelTerminatedSlice instead of methods on collections. :/

Thinking about how it would fit into a large codebase. Functions that deal with C apis would receive SentinelTerminatedStr I suppose instead of:

However to be a substring, it would require that substring to have a NUL at the end of it still. If it was not there, then the caller has to construct a std::string or whatnot still. In this case there's no winning though, even if you receive a non-NUL-terminated substring then you have to malloc/memcpy out of it. So this lets the caller see the cost which is a strong plus and aligned with the principles for no hidden costs especially hidden mallocs.

This is nicer than the concept idea in that it has less codegen and thus less compile times, and can work on virtual or dylib ABI methods. Either way the parameter type has to be written as a different/new thing so it's not worse in that regard either.

danakj commented 1 year ago

I can imagine something like a static SliceWithTerminator::from(Slice) -> SliceWithTerminatorBuf or maybe static SliceWithTerminatorBuf::from(Slice) -> SliceWithTerminatorBuf and the same from span and string_view etc, for a concise way to say that you need to add the terminator, do a malloc/memcpy pass it along and then free it, instead of writing std::string(unterminated).c_str(). This would also allow sus::into(unterminated) to produce the right thing.