Serialize sparse vectors as bytes field and use the protobuf classes directly

dadib commented 6 years ago

A problem when deserializing models is that parameters need to be kept in memory twice. Once in the blayze classes and once in the generated protobuf classes. The generated protobuf classes take much more memory since all the int parameters get boxed.

This changes the SparseIntVector to use bytes to encode the indices and values. It then accesses values out of those fields directly so no boxed values are stored and there is no duplication since the model is just using the protobuf generated class.

Downside is that the serialized format is not as portable.

There is some slowdown associated with the change as well in jmh benchmarks:

dadib commented 6 years ago

It doesn't make sense for SparseIntVector to have ByteString indices and values. That's an implementation detail of the serialization/deserialization.

Not sure I agree with this. The ByteString says nothing about how to serialize it. The process of ser/deser for it is in fact "do nothing with these bytes". The byte string is useless outside of the context of the SparseIntVector class. It's more like a custom data structure.

1) It's ugly

yep

2) otherwise you'll have to "deserialize" (asInts) every time you need to use the vector.

Yes, but this is not exposed outside the class and the performance penalty is negligible since no deserialization or data copying is happening. It is being accessed just as an int[] would be.

3) doing the serialization from ByteString to int[] is already a huge memory improvement over the bytes -> List-> int[]. Do we really need the extra memory gains?

Whether it is needed of course depends on your machine and model. The main benefit here is I think that the memory use characteristics become much better. You won't create a model that you then need twice the memory to load.

I agree with the first point though. It is uglier. The custom data structure has already been introduced with the bytes field though so it is just as unsafe either way. This is just a question of code legibility. The reason I did it like this is that I think trading off reduced local legibility in the SparseIntVector class for eliminating large memory spikes/gc is worth it.

dadib commented 6 years ago

Can you maybe summarize the numbers in a table or make a graph or something? It's quite hard to see what's changed.

It is in a tabular format but looks like github removes redundant white spaces. I'll attach the files.

2.0.benchmark.txt 3.0.benchmark.txt

rasmusbergpalm commented 6 years ago

I meant the only reason we use a bytestring and not a int[] is because of serialization concerns. It's pretty surprising to the average user to see a custom data structure for an int[], and understanding why it's there you have to understand the serialization/deserialization concern. I realize it's a private constructor but still.

score

s/ops look almost identical across all benchmarks, so should not be a concern.

So, we agree it's uglier, and that it's probably more memory efficient, and that the tradeoff is ugliness/code legibility vs. memory bytes. So I think knowing how many memory bytes we save is an important part of the decision. Do you have any idea? Can we reason about it? Can we measure it?

dadib commented 6 years ago

I ran two benchmarks and it seems reading from the ByteString directly is not as memory efficient as I expected. There seems to be little difference between using the ByteString directly or copying to int[].

The graphs below show the memory usage for our production models. In a benchmark where two of our production models are loaded a few seconds apart, followed by gc.

ByteString model

copy to int[] model

I'm not sure exactly why the difference isn't bigger but since reading from ByteString isn't adding much I'll update the PR with a version that just copies to int[]

dadib commented 6 years ago

Oh, yeah. Good catch!

Tradeshift / blayze

Serialize sparse vectors as bytes field and use the protobuf classes directly #15