Open danakj opened 1 year ago
If you ever take the address of a constexpr variable, it introduces relocations at startup.
Thus using std::string_view
for constexpr string literals, and then passing it as a const&
, would always be bad.
We saw in https://danakj.github.io/2023/06/05/not-generating-constructors.html that it can be good to allow passing view types as references. To support that, we should have a separate StringLit
type that's used for string literals. The StringView
type (and String
type) should be constructed from it receiving it by value, not by const&
, so as to avoid causing relocations.
Wrote more here: https://sunny.garden/@blinkygal/110940469036696888
Did you know there's no good way to receive a string parameter in C++ when you are working with C libs/apis that doesn't create string copies and heap allocations for no reason at all?
C libs need a char* with a NUL terminator.
If you receive a char*, then your caller can't give you a string subset, they have to copy into a new buffer and put a 0 on the end. There's a chance they can do this on the stack at least but if they use std::string to do it (probably they should) it means a heap allocation (let's ignore the SSO thing).
If you receive a string_view (the new modern thing!) you can receive any substring without a problem. But string_view erases the knowledge of whether the string was NUL terminated, and checking for it can be an OOB read and crash or do bad things. So you are obligated to always copy the string_view into another buffer and add a NUL at the end before passing it off to the C api.
If you receive a const std::string&, and the caller had a std::string, you're now zero-copy. Buf if the caller had a string literal (like "hello") or a string_view, then a std::string will be constructed again and do a heap alloc/memcpy into it.
Ideally you'd be able to keep track of whether there's a NUL at the end, but string_view's size omits it (to match string, which omits it but is always NUL terminated). If you could tell, then you could do the copy only when needed.
One option is to write 3 overloads for every function all through your stack. :blobcatreach:
Some current pros/cons and thoughts around string options.
Cons to making string types:
Pros to making string types
This is actually a big pain point of interop with Rust that people don't get to feel yet cuz we're not past the UB. But adding painful interop with C++ would be very strongly bad.
I don't have a good feeling that a non-NUL-terminated UTF-8 string is the right choice for any C++ code tho at this point. So I have been just using std::string
so far, and I don't have a good feeling about what to do either way atm.
For https://github.com/chromium/subspace/issues/326 I am currently pondering over
if constexpr
a branch that avoids any mallocs or branches around mallocs.@evmar pointed me to https://ziglang.org/documentation/master/#Sentinel-Terminated-Slices
Zig has a SentinelTerminatedSlice
(with concise syntax) that can be constructed out of any other slice by choosing a range that has the terminator at the end. It could then report its size without the terminator but you could ensure it was there in the type system for C apis.
This idea can't be the "concept that matches all three" idea, as it needs to guarantee the null terminator. The other ideas I had so far track if there's a terminator.
Presumably such a type would convert to a regular Slice, and drop the terminator. Vec could grow .as_terminated_slice<Terminator>()
. A string type could have .as_terminated_str()
or whatever. But if we don't have our own string type or want to play with std::string I think it has to be a ctor on SentinelTerminatedSlice
instead of methods on collections. :/
Thinking about how it would fit into a large codebase. Functions that deal with C apis would receive SentinelTerminatedStr
I suppose instead of:
However to be a substring, it would require that substring to have a NUL at the end of it still. If it was not there, then the caller has to construct a std::string or whatnot still. In this case there's no winning though, even if you receive a non-NUL-terminated substring then you have to malloc/memcpy out of it. So this lets the caller see the cost which is a strong plus and aligned with the principles for no hidden costs especially hidden mallocs.
This is nicer than the concept idea in that it has less codegen and thus less compile times, and can work on virtual or dylib ABI methods. Either way the parameter type has to be written as a different/new thing so it's not worse in that regard either.
I can imagine something like a static SliceWithTerminator::from(Slice) -> SliceWithTerminatorBuf
or maybe static SliceWithTerminatorBuf::from(Slice) -> SliceWithTerminatorBuf
and the same from span and string_view etc, for a concise way to say that you need to add the terminator, do a malloc/memcpy pass it along and then free it, instead of writing std::string(unterminated).c_str(). This would also allow sus::into(unterminated) to produce the right thing.
CStrings, OSStrings, Strings, string constants. Char (aka unicode codepoints).
There's a lot of possible ways to take this design space-wise.
What's important is that C++ strings full of not-utf8 have somewhere to go that won't crash. So probably String isn't utf8. Maybe UString or something.
Also string constants should not need relocation, so if we can't do that with a type then we can't and they should stay as chars but I think we can. string_view does it now? constexpr? consteval?