felixguendling / cista

Cista is a simple, high-performance, zero-copy C++ serialization & reflection library.
https://cista.rocks
MIT License
1.78k stars 113 forks source link

Discussion: STL containers can actually be supported ? #160

Closed AdelKS closed 1 year ago

AdelKS commented 1 year ago

Hello!

Cista is a very nice serialization library (the structural hashing and hashsums are definitely a must have that others don't have), one thing that may be limiting adoption is that one needs to switch to cista home made containers, which probably a no-go for many projects.

I just thought that maybe most, if not all, STL containers can actually be supported :thinking:

Let's take std::vector<T> for example, the basic idea is to do the following for serialization

std::vector<T> -> cista::vector<T> -> serialized

then for desrialization, the opposite

serialized -> cista::vector<T> -> std::vector<T>

Why would this work ? To define the entire state of an std::vector<T>, we only needs its data, and cista::vector<T> contains it all. This basic idea involves an extra copy for both serialization and serialization, but for std::vector<T> we can avoid the copy at least for the serialization, as we can serialize directly an std::vector as if it's a cista::vector, in terms of binary layout.

Then, given that the current T* deserialize(std::vector<uint8_t serialized) function returns a pointer T*, which can point to to an address within the the data contained by serialized, an extra function, T desrialized(std::vector<uint8_t serialized) can be added to handle STL containers, since a copy needs to happen anyway.

It may be a worthwhile tradeoff for some projects. I do not need this in my use case, just wanted to bring it up.

What do you think :thinking:

Adel

felixguendling commented 1 year ago

Hey Adel! Thank you for your kind words and your idea.

The intermediate step to convert to cista::vector<T> is not required. Here is an example how std::vector<T> can be serialized and deserialized without this extra conversion.

This is also what other libraries do that require copying data. cereal is one example.

I think one of the biggest advantages of cista's serialization mechanism is that you can use data without deserializing at all, i.e. reinterpret_cast<T const*>(memory_mapped_ptr) enables you to use serialized data directly in many cases (no copies, no extra steps).

If you start to add all the copying steps to the deserialization step, you trade off developer convenience vs. deserialization performance.

Since in my use cases deserialization performance is key to being able to develop with large datasets as well as very helpful in production, this is not something I am focused on.

I'm not against adding this to cista. However, it would definitely come with a big warning sign that this is not very high performance and probably approaches like Cap'n'Proto, FlatBuffers, etc. will have better performance (because those are zero-copy and enable memory-mapped usage).

AdelKS commented 1 year ago

The intermediate step to convert to cista::vector is not required. Here is an example how std::vector can be serialized and deserialized without this extra conversion.

Oh it's nice to see that actually std::vector is "unofficially" supported without conversion !

I think one of the biggest advantages of cista's serialization mechanism is that you can use data without deserializing at all, i.e. reinterpret_cast<T const*>(memory_mapped_ptr) enables you to use serialized data directly in many cases (no copies, no extra steps).

I agree. Now image cista can also compete with cereal, that would be great ! And users can decide to tradeoff convenience for speed if they wish so, with a small change (maybe, if they don't use something that has a different interface in cista)

Since in my use cases deserialization performance is key to being able to develop with large datasets as well as very helpful in production, this is not something I am focused on.

Very understandable.

I'm not against adding this to cista. However, it would definitely come with a big warning sign that this is not very high performance and probably approaches like Cap'n'Proto, FlatBuffers, etc. will have better performance (because those are zero-copy and enable memory-mapped usage).

Yeah absolutely, but still the same as cereal and boost.serialization in terms of speed, isn't it ? For me the header generation approach is definitely a blocker.

I can reach out if ever I need support for another STL container, and see if I can make a PR with that, something like that.

felixguendling commented 1 year ago

Yes, it de-serializing std:: data structures with cista should yield performance in the ballpark of Boost.serialization or cereal.

Serialization of std::map<K, V>, etc. should work with the same principles as those used in the std::vector<T> example. Feel free to open a PR if you want to add those functionalities to cista.