attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.44k stars 267 forks source link

Change encoding to not use varints so aggressively #3674

Open arv opened 6 years ago

arv commented 6 years ago

We use varints in a few "internal" places in the encoding for Value.

One reason we did this was because we thought varints would compress better.

Problems with using var ints

Suggestion:

Use uint64

It might actually be fine to use uint32 or even uint16 since prolly trees are log of the number of elements in the total collection but with snappy those extra zeros should be cheap enough that we should not care.

arv commented 6 years ago

@rafael-atticlabs WDYT?

cmasone-attic commented 6 years ago

This actually cropped up in a profile @rafael-atticlabs and I were looking at yesterday, too. Your point about compression is well-taken, so...I'm probably in favor of this overall? One question...in your work to back Values directly with encoded chunk data, is that data compressed or not? I assume not. I think maybe we'd want to use something smaller than uint64 to reduce the memory hit while the Values are resident in memory

arv commented 6 years ago

A long time ago I was thinking to actually have multiple number encodings. We would use a different noms kind for numbers based on the actual value. We would pick the smallest one that we need...

I'm not convinced which is better though; always uintX or kind+num.

ghost commented 6 years ago

I'm not totally sold on ditching varints yet. At best uint32/64 will compress to 3 bytes. Most common values for our quantities compress to 1 byte -- and there are alot of them -- especially number of fields in this struct. I'd like to consider all of our options of reducing the decoding/encoding expense here. For example, I think we might be able to "fast forward" over varints with less expense than fully decoding them.