borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.13k stars 741 forks source link

Extensible forward- and backward-compatible HashIndex #2757

Open enkore opened 7 years ago

enkore commented 7 years ago

Currently:

| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |

Problems:

(1) Can't add fields (e.g. csize to Repository) without breaking compatibility (2) Load factor has a large associated factor (key+value length 40-44 bytes) (3) Resizing moves a lot of stuff around and requires a lot of extra memory (4) Iteration moves over uninteresting data, e.g. summarize_chunks doesn't care about the key, yet wastes 70 % of its memory accesses on them.

Instead:

index array, length = num_buckets, indexed by hash % num_buckets.
indices = uin32_t, thus low load factors are basically free.

| index |
| index |
| index |
| index |
| index |
| index |

separate, compact arrays for each column/field

| key |
| key |
| key |
| key |
| key |

| ValueA |
| ValueA |
| ValueA |
| ValueA |
| ValueA |

...

All key/value arrays are coindexed by the index from the index array

On disk-format has to tell the number of value arrays and their geometry (length is implicit = num_entries, width must be noted).

Thus, an older version can load an index with unknown columns with no issues. When saving, it would discard the extra columns (or preserve them... but then a bit mask is needed to tell which columns were treated properly, e.g. updated).

This could technically be tacked onto an existing HashIndex, although the advantages would be fewer.

enkore commented 7 years ago

(Technically this is not breaking, since HashIndices are technically not essential)

ThomasWaldmann commented 7 years ago

About "removes in-band signalling for D/E":

guess you'ld need either another kind of in-band signalling (reserve some index values) or add an extra column for it (extra memory needs).

About using uint32 as index:

Could be an issue for very large repos (> 2^32 chunks). This means quite large indexes (>160GB repo + >172GB chunks + files cache), but is not quite unthinkably large.

The overall data set size for such a repo doesn't need to be extremely large: as a small file always gives 1 chunk, you "only" need 2^32 little files.

Using (u)int64 indices fixes this, for a price.

2.0 milestone: good. We should first finish the robin hood stuff.

enkore commented 7 years ago

The index values are private to the hash table, i.e. that's not in-band signalling.

Could be an issue for very large repos (> 2^32 chunks).

To be fair—the current hash index stops at about 2^31 and then just stops resizing the table until it is at load factor 100% which will then finally result in failing inserts.

But I agree, the disk format should be 64 bit safe, but I think it'd be worthwhile to only switch to 64 bits when necessary, so everyone will, in practice, profit from using 32 bit indices.

enkore commented 7 years ago

This design would in theory allow to externalise columns, so e.g. in the cache you could have only a single key array for the chunks cache and all archive.chunks.d (this requires extra work for consistency), which would largely nullify the remaining size issues of chunks.archive.d. With adequate abstractions you could even tap the files cache into that, mostly nullifying its size issues as well.

Again, with good enough abstractions you could even bridge the repository index and the client cache for local repositories so that you only need one key array for both, both in memory and on the disk.

If we had all of these, Borg's memory usage would be reduced by a huge amount.

This is a powerful concept. (But quite something to implement as well, even on a clean-slate).

ThomasWaldmann commented 7 years ago

It's internally in-band signalling. An index entry can be either an index or signal deleted/empty. :-P I agree, as it is internal, it's not a big problem, just needs to have some reserved values. Considering 32/64 bit flexibility, the reserved indices maybe should be at the beginning. like 0 = empty, 1 = deleted.

enkore commented 7 years ago

Technically no, because that's the purpose of the index array, it tells you where to find something and whether it is in the structure in the first place. See

The difference compared to some other data structures is in the overall encoding. In a hash table you need an explicit encoding for "not here", unlike e.g. trees and tries.

The encoding you bring up has a lot of advantages here and is clearly preferable to others. Especially zero = empty is an important choice here.

ThomasWaldmann commented 7 years ago

Yes, would remove some redundancies. Files cache only shrinks when replacing the chunks id list with a chunks index list.

ThomasWaldmann commented 7 years ago

Considering memory errors like bitflips and that we might need more than 2 reserved values, we could reserve a few more (256?) and use a bit pattern for "deleted" like 0x10101010 (or alternatively: parity / ecc instead of pattern magic).

We want to avoid accidental flip from deleted -> empty by all means as this could make multiple chunks disappear from hashindex by making them unreachable.

enkore commented 7 years ago

I kinda doubt the efficiency of that approach, since (1) index-array is small (2) the code space is fully utilised (unless one goes for a really strange encoding and reduces the number of index bits bit a lot), so say going from 1 to 3 is equally likely as going from, say, 0x100... to 0x00... Now, if you say, let's make it a pattern, , e.g. 0x81010..., then you're still hinging on that one bit. When you start with let's make it an area of values, then it'd be better to just use an actual ECC instead of inventing a bad one in-situ...

[(1) outweighs most others imho]

ThomasWaldmann commented 7 years ago

It is efficient as a single or few bit flips can not make the deleted -> free transition (and neither the free -> deleted). It would transition to "invalid" in both cases and one could notice. The alternating bit pattern also makes 8bit-one or 8bit-zero flips harmless (the latter would need free pattern = ~ deleted pattern).

Note: this doesn't fully solve bad memory issues. for that one would need ECC for all the bits...

... which could be just another "ECC" or "check" column. In the most trivial/fast way: just xor all the entry's bytes and write that to the check column.

enkore commented 7 years ago

Another thing: Co-hashing is viable with this approach. That essentially eliminates wrong key lookups for probing chains.

index array, length = num_buckets, indexed by hash % num_buckets.
indices = 2×uin32_t, thus low load factors are basically free.

| index | co-hash |
| index | co-hash |

Since the hash is the first 32 bits of the 32 byte IDs, the co-hash could just take the last 32 bits, or bits 32 through 64. By comparing the co-hash before comparing the actual keys, virtually no negative key look-ups happen (how likely is a collision of the h%N and a separate 32 bit portion unrelated to h?), so for any lookup only one key will have to be compared. A negative lookup would be virtually only handled by the index array.

enkore commented 7 years ago

Given the broad swathe of advantages, this may be something for 1.3, but probably only for the cache (where the advantages make it devastatingly superior to the current solution, even if it were slower, although I'd expect that it is faster due to the far better lookup described above).

ThomasWaldmann commented 1 week ago

I've recently played with a pure cython hashtable, maybe we can have something like the above soon.