Open bnewbold opened 6 years ago
we decided to disable rabin fingerprinting by default earlier this year, around when we switched to blake hashes. the reason was it limited import speed by a factor of around 4X. now we use fixed size 64k chunks by default.
yes
thats the scenario for rabin, right? with fixed size chunks we only dedupe if the chunk content is the same, nothing smarter than that
Thanks for the reply!
This seems to be other projects' experience as well (eg, IPFS). Hopefully somebody (a researcher?) will come up with a faster robust chunking scheme some day.
I'm confused how this works with hyperdrive, which specifies only a single (blocks, offset) pointer for a given version of a file, implying that all file content must be stored in a single contiguous chunk in the data register. If two ~10 MB files each shared the first ~8 MB of data (with ~2 MB of unique data), how would the second file be deduplicated against the first (for the fixed chunks that are the same)? This also applies to file changes... I can see how appending to the most recent file just adds chunks, and the Stat entry can be extended easily for each version, but if an unrelated file was added (with new chunks), now the appended file would have it's chunks fragmented, and I don't see how hyperdrive would handle the deduplication.
For now, de-duplication (adding locally or downloading) doesn't happen in the official dat client or hyper* libraries.
Three thoughts about de-dupe:
If we had complete hashes of entire files as part of Stat metadata, we could index that and check if the whole file already exists before adding/downloading. This isn't the current spec.
As-is, it would probably work well enough to lookup the file size and first chunk hash in the complete data history (tree). If those matched, it would be work spending the computational resources to hash the new file and the (contiguous chunks of) old file; if they are an exact match, just point backwards in the data register.
Finally, many of these de-dupe optimizations are interesting both in a single (large) dat archive or when used against a huge repository of thousands of archives: doing download or disk-storage de-dupe between archives, potentially taking advantage of a novel storage backend. One simple back-end design would be a content-addressed key/value (hash/data) lookup of, eg, many TB of data.
Seems like it should be an option. It sounds like you can still do it though by using the modules.
here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553
I'm quite interested in deduplication. I couldn't find software which does both deduplication of data in transit (incremental transfers, with variable size chunks) and deduplication of data at rest (with a chosen scope). It seems like a sweet design spot, especially with a p2p approach.
Are you aware of https://github.com/ronomon/deduplication ?
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++
It achieves 1GB+/sec throughput. If I'm not mistaken, Rabin-Karp hashes at throughputs in the order of 100MB/s. It seems like that could help to deal with the import speed issues which led to removing content fingerprinting?
If I understand correctly there are a number of aspects to consider to implement deduplication in dat:
dat-store
? here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553
Here's a permalink to on master today https://github.com/mafintosh/hyperdrive/blob/b7121f5ecc97596722af4b65ecea74b8f2158774/index.js#L404
@mafintosh Could hypertrie's customisable rolling hash function be used for deduplication and/or incremental transfers?
@andrewosh Could the corestore approach to have the content feed be a dependent of the metadata feed be used to add content aware chunking? Seems like namespaces could be a good way to define deduplication scopes too...
people are very busy i guess - just saw your comment. you could try again here: https://gitter.im/datproject/discussions
Yes, we are active & busy with other things. @jmatsushita to get the ball rolling on something like this it would be a good idea to join a comm-comm meeting for finding allies and gaining perspective.
Three questions about the current/stable dat client and ecosystem library behavior:
Questions 1+2 are just curiosity about current behavior; I don't see anything in the spec that would prevent clients for implementing these optimizations in the future. Question 3 comes after working on an implementation; maybe I need to go back and re-read the spec.