felixguendling / cista

Cista is a simple, high-performance, zero-copy C++ serialization & reflection library.
https://cista.rocks
MIT License
1.74k stars 110 forks source link

Support storing trailing \0 byte at the end of string #187

Closed khng300 closed 8 months ago

khng300 commented 1 year ago

Hi, not sure if I miss anything but I recently discovered cista::generic_string did not store the \0 byte at the end of a long string (or string that just hit the short_length_limit length limit). As a workaround I currently draft my own string type for the purpose.

Is there any plan to work this out? Or do we need to propose a new type?

(Not really related but just a side-topic: What about support storing \0 within the content of a short string?)

felixguendling commented 1 year ago

Correct. Currently cista::string/string_view have both the behavior of std::string_view in the way that it doesn't store the terminating \0 like C-style strings do. The reason is that usually this terminating \0 is not something you want to have serialized into a compact binary buffer. A terminating \0 is not necessary in case you know the exact length (which is the case in cista::string). The only reason you might want to have the terminating \0 would be compatibility to library code written in C. In all other cases, you do not want to have the overhead of storing/transmitting obviously redundant information (size + \0 terminator).

It is, however, not that hard to trick cista::string into storing your extra \0. One way would be to call the constructor that takes a char const* and a length. There, you can set the length to the length of the string including the terminating \0. You might want to create a wrapper around cista::string that uses this trick in a few more places. But I don't think it's necessary to create a completely new type for this purpose.

https://github.com/felixguendling/cista/blob/0a7a784a3e5b40bea8f5c696d957be195bfd510c/include/cista/containers/string.h#L341

ChemistAion commented 1 year ago

Consider an idea of automatic "null-terminator with size" for small-string optimization (by Andrei Alexandrescu): https://youtu.be/kPR8h4-qZdk?t=410

With a little bit of "mixing" and use it as could be embedded for non small-string (adding extra 4/8 bytes at the end for size/null-terminator, as above).

This will help to use cista::string.data()/.begin() directly for const char* inputs, since now we have to go through conversion to std::string_view/std::string.c_str().

felixguendling commented 1 year ago

That technique makes sense. Currently, the cista::string does not have a capacity (only size). The idea is that for serialization, the capacity and size would always be the same, so there's no point in having an extra field. If you use the data structure as a replacement for std::string that's a different story.

Overall I think it makes sense to write a new generic class, that can work as a vector and a string with the "small-vector" or "small-string" optimization. Making this generic has the advantage that cista::vector would not need to allocate memory in case the data fits into its fields. Another advantage would be to be able to change CharT and have a cista::wstring. Doing Andrei's optimization would also be nice.

However, currently I am busy with another project. So don't expect this to happen very soon (this also applies to the other issues you opened which would probably also benefit from this change).

ChemistAion commented 1 year ago

Thank you for your analysis, I appreciate your insights. ...in the meantime, I will try to propose: no so generic solution - focusing specifically on non-heap string in the next few days.

ChemistAion commented 7 months ago

@khng300 Wow, superb work! I will be conducting comprehensive tests on my end throughout the weekend.