attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.44k stars 266 forks source link

Compute hash as we decode #1615

Open arv opened 8 years ago

arv commented 8 years ago

With the binary encoding we can compute the hash of sub values as we read the data.

When we start reading a value we start a new Hash. As we read the data we both feed that into the current "hashers" and into the decoder. When a value is completed we already have the hash for that value and we can just set it on the object once and for all.

@rafael-atticlabs

arv commented 8 years ago

@cmasone-attic @kalman

aboodman commented 8 years ago

It will make reading slower, but it has the advantage that you don't need to reserialize to get the hash.

How does buzhash factor into this?

On Tue, May 24, 2016 at 10:56 AM, Erik Arvidsson notifications@github.com wrote:

@cmasone-attic https://github.com/cmasone-attic @kalman https://github.com/kalman

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/attic-labs/noms/issues/1615#issuecomment-221347314

cmasone-attic commented 8 years ago

You're right that my PR relates to this...does it obviate the need for this? Are there cases that this would cover where we can't just pull the hash from the Chunk being decoded?

arv commented 8 years ago

It would add the the hash for sub values too. The question is how often we need the hash for inlined values?

cmasone-attic commented 8 years ago

ah, my eyes are opened

ghost commented 8 years ago

I think we should prefer the other direction and not only lazily compute hashes, but lazily decode values. I.e. https://github.com/attic-labs/noms/issues/2270

There are places that caching the hash of a value will matter alot, such as sorting non-scalar values for insertion into collections, but I think those should be handled specially.

@arv, can we close this?

arv commented 8 years ago

@rafael-atticlabs I don't see how lazily decoding values changes this? I can see that if we hold on to the chunk instead of creating the value and discarding the chunks, computing the hash will be a lot cheaper since we do not need to encode the chunk again.