dpc / rdedup

Data deduplication engine, supporting optional compression and public key encryption.
831 stars 42 forks source link

Indexes and random access. #74

Open dpc opened 7 years ago

dpc commented 7 years ago

By writing index files one could have random access to stored data.

There is unfortunate naming collision. I'll call current index files, a chunk-index, and the new data byte-index.

Initial idea: for every chunk-index there could be a corresponding byte-index consisting of 128bit values: each value for every chunk in chunk-index, describng the total size of the data from the beginning to the given chunk (inclusive).

Using a binary search on byte-index one could quickly O(log n) identify a chunk storing given portion of data. This could eg. be used to implement exposing data as block devices or mounting them as fuse filesystems.

phillipCouto commented 7 years ago

Wouldn't it be ideal if the chunk-index and byte-index were combined?

So the format could be: | start position | chunk sum |

This way the reads are optimized to only working with one set of files.

Ralith commented 6 years ago

Discussion on gitter concluded that the offsets stored should be from the beginning of the leaves beneath the current chunk of the index, rather than from the beginning of the data stream. In other words, every chunk of the byte index starts counting from zero, whether it's the root chunk or an internal chunk deep in the tree. This allows incremental updates to only touch O(log n) chunks of the byte index.

Separate vs. unified indexes were also discussed. An advantage to having separate indexes is that a complete linear read, as used to do a full restore, will require less bandwidth in all cases because the unneeded offset data won't be fetched, and random reads will require less bandwidth on backends that support partial chunk reads (such as local disk) because only hashes that must be traversed to reach the bytes of interest will be read.

aidanhs commented 4 years ago

Three thoughts I want to note down related to this issue (but not to each other):

  1. when considering index design, it's well worth reading how bup does indexes as they've taken git and carefully considered how to extend it to their use-case

  2. I think random access makes sense in two scenarios - 1. you have a disk image you want to mount, or 2. you want to expose a bunch of files in a backup as a filesystem (per #161). 1 is fine, but 2 presents a problem if you don't have an appropriate container format for your backups (tar is very much not a good container format). bup deals with this by having first-class handling of files on a filesystem - it may be worth considering this route (see also #5), or at minimum finding a container format we want to recommend

  3. finally and on a bit of a tangent - I think it's worth reconsidering the on-disk format, as currently this is generating ~1million files for a relatively modest backup of ~100GB. This is not ideal as I don't expect filesystems to be optimised for rdedup's use-case (if nothing else, we don't need the metadata the filesystem is obliged to store) and it becomes massively worse if you want smaller chunk sizes (which I do, because bup's 8k gives me 20% savings on this backup). Tools also often struggle with millions of files (just running stat on them all takes a long time). By comparison, bup creates about 50 files for the same size backup. I'll raise an issue with more notes when I have a more concrete proposal in mind (see also #131)

dpc commented 4 years ago

@aidanhs

  1. So far rdedup was content independent. Everything was just one big blob, and it's up to the user how to interpret it. I could image someone just dumping a filesystem image into rdedup etc.

  2. There were some plans for packing chunks. Also the chunk size if configurable, at the expense of deduplication efficiency when small pieces of data change.

geek-merlin commented 3 years ago

Note: One could also mitigate the huge-directory issue (like git does) via prefix dirs (blobs/ac/ab/acab0815.blob). (EDIT: this had been done long ago.)