VERY slow deserialization of large objects

USCiLab / cereal

A C++11 library for serialization

BSD 3-Clause "New" or "Revised" License

4.22k stars 761 forks source link

VERY slow deserialization of large objects #413

Open bloodcarter opened 7 years ago

bloodcarter commented 7 years ago

I'm archiving this data:

    std::vector<std::shared_ptr<Lemma>> lemmas;
    std::map<std::string, std::vector<std::shared_ptr<Form>>> forms;

The size of lemmas vector and forms map is ~100,000 each. The problem is deserializing from portable binary takes 30 secs of my Core i5!

        std::ifstream is("dict.cereal", std::ios::binary);
        cereal::PortableBinaryInputArchive iarchive(is); // Create an input archive
        iarchive(lemmas, forms);

Is that normal or what?

AzothAmmo commented 7 years ago

What is the structure of Lemma and Form? If you can't post actual code, can you just describe what they are serializing (and also the sizeof)? Are you using polymorphism?

I'll try and see if I can reproduce this. Our binary serialization should be very fast.

temehi commented 7 years ago

I am also experiencing a similar problem. I have the following data to serialize:

    std::unordered_map <uint64_t, std::bitset<60> > my_map;

my_map contains about 8-billion elements, and the binary file saved is around 33 GB on disk. when I deserialize it using

    std::ifstream istrm("map.cereal", std::ios::binary);
    cereal::BinaryInputArchive iarchive(istrm);
    iarchive(my_map);

I takes about 2500 secs. Isn't that a bit slow?

erichkeane commented 7 years ago

I would say that depends. loading that much data into memory is going to be time consuming either way. having to send that to swap is going to be quite time consuming.

Additionally, with that much data, the unordered_map is going to be re-indexing near-constantly. With that much data indexed by a uint64_t, you are likely better off choosing a different data structure (depending on your distribution of keys).

temehi commented 7 years ago

Thanks for your reply

... having to send that to swap is going to be quite time consuming.

No need to send to swap, for my particular problem, having enough memory is not an issue.

One way to avoid re-indexing/hashing to call reserve (size_type count); function on the unordered_map object. If I do that, the loading time goes down to ~1000secs.

erichkeane commented 7 years ago

Well, ifstream seems to do an additional copy as a part of it as well, so you're copying the data at least 2x. Perhaps consider using something like boost::iostreams::mapped_file. That'll probably save you another few hundred seconds.

Additionally, are you compiling with optimizations on? The cereal code is pretty template heavy, so it benefits extremely well from higher optimization levels. Particularly setting things like -march=native (if that is acceptable).

AzothAmmo commented 7 years ago

We can definitely add a call to reserve for unordered_map loads.

Rinkss commented 6 years ago

serialization and deserialization of map<int, vector > is very slow. i am passing this as object to the archives . It's taking around 3secs to deserialize 113MB file

Rinkss commented 6 years ago

Also similar problem arise when I try using map<int,string>

Batodalaev commented 3 months ago

Can we add map.reserve(size); here - https://github.com/USCiLab/cereal/blob/master/include/cereal/types/concepts/pair_associative_container.hpp#L56 ?

UPD. also: set.reserve(size); https://github.com/USCiLab/cereal/blob/master/include/cereal/types/set.hpp#L58