attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.45k stars 267 forks source link

Avoid creating gigantic chunks with Lists of very long repeated values #3765

Closed ghost closed 6 years ago

ghost commented 6 years ago

Towards #3743

Includes a set of changes aimed at better handling of Lists which contain long subsequences of repeated values (e.g. 1.1 Billion "yellow" strings).

Turns out a few things were happening here

1) The sloppy lookup table was only 16 bits and so when the input byte stream crossed 64k it stopped find matches. Fix: the lookup table now stores 32 bit offsets

2) "yellow" repeated encodes as patterns of 8 bytes, this (by chance) drove the rolling hash to zero from which it never recovers. Fix: make the hash window a prime value (67 bytes).

3) There already is a maxCopy in sloppy, but this effectively results in the rolling hash seeing long repeated sequences of 3 bytes (just repeats of the copy). The reduces entropy further and the input sequence never chunks (just by chance). Fix: add a maxChunkSize of 16MB.

4) Bump the noms version

Note that this patch leaves in place the problem that if you inserted another yellow at start of the 1.1 billion yellows, it'll cascade all the way to the end. I think the right solution to that has to be something like the repeat kind. For now, this will allow us to successfully import insanely large columns which have one value.