Current chunking: content-aware? de-duped?

bnewbold commented 6 years ago

Three questions about the current/stable dat client and ecosystem library behavior:

are imported files chunked in a "content-aware" way, eg using Rabin fingerprinting? I've seen mention of this in the past (eg, https://blog.datproject.org/2016/02/01/dat-1-0-is-ready/), and I see https://github.com/datproject/rabin, but quick queries of the hyperdrive code base don't turn anything up.
does hyperdrive handle full-file de-duplication? Eg, if the same file is added under different names, or a file is added and removed, will the metadata feed point to the same blocks in the data feed?
does hyperdrive handle partial-file de-duplication? Eg, if a long .csv file has text changed in the middle (not changing the chunking or overall length), will only the mutated chunk get appended to the data feed? The current metadata implementation seems to be "chunk offset and length" based, so i'm not sure how a sparse set of chunks would be found.

Questions 1+2 are just curiosity about current behavior; I don't see anything in the spec that would prevent clients for implementing these optimizations in the future. Question 3 comes after working on an implementation; maybe I need to go back and re-read the spec.

max-mapper commented 6 years ago

we decided to disable rabin fingerprinting by default earlier this year, around when we switched to blake hashes. the reason was it limited import speed by a factor of around 4X. now we use fixed size 64k chunks by default.
yes
thats the scenario for rabin, right? with fixed size chunks we only dedupe if the chunk content is the same, nothing smarter than that

bnewbold commented 6 years ago

Thanks for the reply!

This seems to be other projects' experience as well (eg, IPFS). Hopefully somebody (a researcher?) will come up with a faster robust chunking scheme some day.
I'm confused how this works with hyperdrive, which specifies only a single (blocks, offset) pointer for a given version of a file, implying that all file content must be stored in a single contiguous chunk in the data register. If two ~10 MB files each shared the first ~8 MB of data (with ~2 MB of unique data), how would the second file be deduplicated against the first (for the fixed chunks that are the same)? This also applies to file changes... I can see how appending to the most recent file just adds chunks, and the Stat entry can be extended easily for each version, but if an unrelated file was added (with new chunks), now the appended file would have it's chunks fragmented, and I don't see how hyperdrive would handle the deduplication.

bnewbold commented 6 years ago

For now, de-duplication (adding locally or downloading) doesn't happen in the official dat client or hyper* libraries.

Three thoughts about de-dupe:

If we had complete hashes of entire files as part of Stat metadata, we could index that and check if the whole file already exists before adding/downloading. This isn't the current spec.

As-is, it would probably work well enough to lookup the file size and first chunk hash in the complete data history (tree). If those matched, it would be work spending the computational resources to hash the new file and the (contiguous chunks of) old file; if they are an exact match, just point backwards in the data register.

Finally, many of these de-dupe optimizations are interesting both in a single (large) dat archive or when used against a huge repository of thousands of archives: doing download or disk-storage de-dupe between archives, potentially taking advantage of a novel storage backend. One simple back-end design would be a content-addressed key/value (hash/data) lookup of, eg, many TB of data.

bcomnes commented 6 years ago

Seems like it should be an option. It sounds like you can still do it though by using the modules.

okdistribute commented 5 years ago

here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553

jmatsushita commented 4 years ago

I'm quite interested in deduplication. I couldn't find software which does both deduplication of data in transit (incremental transfers, with variable size chunks) and deduplication of data at rest (with a chosen scope). It seems like a sweet design spot, especially with a p2p approach.

Are you aware of https://github.com/ronomon/deduplication ?

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++

It achieves 1GB+/sec throughput. If I'm not mistaken, Rabin-Karp hashes at throughputs in the order of 100MB/s. It seems like that could help to deal with the import speed issues which led to removing content fingerprinting?

If I understand correctly there are a number of aspects to consider to implement deduplication in dat:

An optional content hashing metadata field would need to be added to hyperdrive metadata https://github.com/datprotocol/DEPs/pull/12
This seems to require some non trivial design decisions @martinheidegger to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/issues/203
A host of other design decisions regarding dat would then also need to be made. Specifically how to declare which scopes deduplicate with each other both locally and remotely. Maybe something that would be a module for dat-store?

here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553

Here's a permalink to on master today https://github.com/mafintosh/hyperdrive/blob/b7121f5ecc97596722af4b65ecea74b8f2158774/index.js#L404

jmatsushita commented 4 years ago

@mafintosh Could hypertrie's customisable rolling hash function be used for deduplication and/or incremental transfers?

@andrewosh Could the corestore approach to have the content feed be a dependent of the metadata feed be used to add content aware chunking? Seems like namespaces could be a good way to define deduplication scopes too...

serapath commented 4 years ago

people are very busy i guess - just saw your comment. you could try again here: https://gitter.im/datproject/discussions

martinheidegger commented 4 years ago

Yes, we are active & busy with other things. @jmatsushita to get the ball rolling on something like this it would be a good idea to join a comm-comm meeting for finding allies and gaining perspective.

dat-ecosystem-archive / datproject-discussions

Current chunking: content-aware? de-duped? #77